For decades we were spoiled by Moore’s law directly translating into an exponential speed increase, the CPU clock was going up exponentially to 3GHz which was reached in 2003, but in the last 5 years it seems to be stuck at that point. Instead, manufacturers try to pack multiple cores into a chip. People started to look for alternative ways to get faster computation (see MRSC 2008 conf.): Field Programmable Gate Arrays (FPGA), General Purpose computing on Graphics Processing Units (GPGPU) and most recently the Cell Broadband Engine (Cell/B.E.) from IBM-Sony-Toshiba.
Tony Williams over at the ChemConnector Blog has had a couple of people ask him for comments about which way to go and which one is better for a particular application ? We’ve just invested two man years of effort porting to the Cell/B.E. and not only do I have strong opinions I also have enough “hands-on experience” to comment!
The April issue of Bio-IT World had an article about the use of GPUs for scientific computing, then I chatted with Attila Berces (CEO of Chemistry Logic) at the Bio IT World Expo, who is an expert in FPGA and had presented a similarity search system implemented on FPGA. Meanwhile we have presented our docking software running on the Cell/B.E. So, all these angles fresh in my head, I have put together a comparative analysis.
Performance and capabilities
FPGA allows hardware level wiring of decision logic, it excels in integer arithmetic, but floating point operations are difficult to encode and do not yield very good performance compared to traditional CPUs. The reason is that CPUs are running at several GHz speed, while FPGAs have clock speeds at a few hundred MHz. Decision logic (branching) is bad for the CPU/GPU/Cell with deep pipeline, but natural to the FPGA. Parallelism can be very wide and massive, not limited by architecture (128 or 256 bit for Cell and GPU). Therefore, FPGA shines for logic intensive tasks that do not need floating point calculations, e.g. discrete math graph algorithms, searching, matching, gene sequence alignment.
GPU and Cell/B.E. are close cousins from a hardware architecture point of view. They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling). They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding. There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the Cell/B.E. the only suitable choice for quantum chemistry calculations. There is a difference in memory handling too: GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques without ever hitting a cache-miss causing stalls in the computation. Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective), so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data. If the purpose is to apply an operation to a lot of data, that comes down to the same, but a more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E. The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a 192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the Cell/B.E. has demonstrated benchmark results (e.g. for real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.
A fair cost comparison requires the ability to measure roughly equivalent processing power, but that is difficult due to the fact that FPGA is better in logic and integer computation, while GPU and Cell/B.E. are better in floating point computation, so what benchmark to chose ? I decided to use a chart from Attila Berces where he compares an FPGA solution to 400 Intel CPU cores. Let’s use that as a reference performance point and see how many Cell/B.E. and GPU units we need to reach that. We also have to differentiate theoretical throughput and practical sustained throughput (see above). I have chosen the practical throughput as the basis of the comparison in the table below:
|Costs||400 CPU cluster||FPGA||GPU||Cell BE|
The range in the cost of the Cell/B.E. solution is due to the very different price points of the various options: cheapest is the Sony PS3 at $400 providing 6 usable SPE core, the Mercury CAB about $8,000 providing 8SPEs, while the IBM QS21 blade is about $10,000 with 16SPEs. High end GPUs have price points around 1 thousand dollar. FPGAs have a high entry point, that was forming the bases of the above table.
Programming effort, compatibility
Last but not least, let me address the necessary programming effort to make use of these acceleration techniques. We have completed our first porting from scalar Intel code to the Cell/B.E. for the eHiTS docking software. We ported about 10% of the code that was responsible for over 98% of the CPU time spent to the SPUs, amounting to a bit over 21,500 lines of code. The total effort — including the learning curve — took about 2 man-years of work. That may seem a lot, but you have to consider not only the learning aspect (the technology was completely new for us when we started), but also that we went down to the lowest assembly level performance tuning, counting individual operation cycles and analyzing every single pipe stall in the tight loops until we got it perfectly streamlined to run at near-peak performance. The vectorization (SIMD data arrangement and operations) would have been necessary also if we target GPUs. The programming of GPUs have traditionally been much more complicated via OpenGL fake graphics calls. Recently, both Nvidia and AMD has issued libraries with more convenient APIs to program the GPUs for generic purpose computations. Nevertheless, you still need to transform the entire code, computation sequence into those API calls. In contrast, you can simply compile your existing C or C++ code for the Cell/B.E. SPU using a variant of the gcc compiler. Of course, if you only do that much, then you will not reach very high performance, your code is still scalar, so all you gain is to run on multiple core (up to 8X performance, but due to branch penalties it is more likely to be around 4X). But the advantage is, that you can start out this way, having your code run about 4 times faster and already on the SPUs with a few weeks of work for a large application. Then you can start profiling where the bulk of the time is spent and focus your efforts to optimize/vectorize only the most important pieces of code. In comparison, both GPU and FPGA require all-or-nothing commitment and effort. The effort required for FPGA is far more significant (several orders of magnitude) because the code has to be taken down way beyond the assembly coding level, all the way to the micro electronics gate logic level.
So, while as I described in our white paper, the Cell/B.E. requires a different kind of thinking and coding than a traditional CPU, the same is true for the GPU and the FPGA and the these later ones require significantly more effort. Another important point is code compatibility and maintenance on multiple platforms. We have done all our vectorization and porting using C++ wrapper classes and functions for which we have two translations: one to the direct Cell/B.E. intrinsic API and another one to simple C scalar code. This way we have a single code base now, that runs both on the Cell/B.E. and on Intel/AMD platform too. In fact, the vectorization have slightly benefited the Intel code too, it runs about 10% faster than before the port. Of course, that is nothing compared to the 50-fold we reached on the Cell/B.E. If you choose GPU or FPGA, then you need to maintain very different code bases for those and for traditional CPUs.
So, I hope I have managed to provide a good overview of the differences between FPGAs, GPUs and the Cell. I’m clearly biased but, I believe, rightly so!