The fast and the furious: compare Cell/B.E., GPU and FPGA

For decades we were spoiled by Moore’s law directly translating into an exponential speed increase, the CPU clock was going up exponentially to 3GHz which was reached in 2003, but in the last 5 years it seems to be stuck at that point. Instead, manufacturers try to pack multiple cores into a chip. People started to look for alternative ways to get faster computation (see MRSC 2008 conf.): Field Programmable Gate Arrays (FPGA), General Purpose computing on Graphics Processing Units (GPGPU) and most recently the Cell Broadband Engine (Cell/B.E.) from IBM-Sony-Toshiba.

Tony Williams over at the ChemConnector Blog has had a couple of people ask him for comments about which way to go and which one is better for a particular application ? We’ve just invested two man years of effort porting to the Cell/B.E. and not only do I have strong opinions I also have enough “hands-on experience” to comment!
The April issue of Bio-IT World had an article about the use of GPUs for scientific computing, then I chatted with Attila Berces (CEO of Chemistry Logic) at the Bio IT World Expo, who is an expert in FPGA and had presented a similarity search system implemented on FPGA. Meanwhile we have presented our docking software running on the Cell/B.E. So, all these angles fresh in my head, I have put together a comparative analysis.

Performance and capabilities

FPGA allows hardware level wiring of decision logic, it excels in integer arithmetic, but floating point operations are difficult to encode and do not yield very good performance compared to traditional CPUs. The reason is that CPUs are running at several GHz speed, while FPGAs have clock speeds at a few hundred MHz. Decision logic (branching) is bad for the CPU/GPU/Cell with deep pipeline, but natural to the FPGA. Parallelism can be very wide and massive, not limited by architecture (128 or 256 bit for Cell and GPU). Therefore, FPGA shines for logic intensive tasks that do not need floating point calculations, e.g. discrete math graph algorithms, searching, matching, gene sequence alignment.

GPU and Cell/B.E. are close cousins from a hardware architecture point of view. They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling). They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding. There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the Cell/B.E. the only suitable choice for quantum chemistry calculations. There is a difference in memory handling too: GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques without ever hitting a cache-miss causing stalls in the computation. Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective), so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data. If the purpose is to apply an operation to a lot of data, that comes down to the same, but a more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E. The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a 192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the Cell/B.E. has demonstrated benchmark results (e.g. for real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.

Cost comparison

A fair cost comparison requires the ability to measure roughly equivalent processing power, but that is difficult due to the fact that FPGA is better in logic and integer computation, while GPU and Cell/B.E. are better in floating point computation, so what benchmark to chose ? I decided to use a chart from Attila Berces where he compares an FPGA solution to 400 Intel CPU cores. Let’s use that as a reference performance point and see how many Cell/B.E. and GPU units we need to reach that. We also have to differentiate theoretical throughput and practical sustained throughput (see above). I have chosen the practical throughput as the basis of the comparison in the table below:

Costs 400 CPU cluster FPGA GPU Cell BE
Hardware purchase $200K-$400K $60K $30K $4K-$40K
Electricity (power+cooling) $180K-$360K $6K $18K $3K
Total cost $380K-$760K $66K $48K $7K-$43K

The range in the cost of the Cell/B.E. solution is due to the very different price points of the various options: cheapest is the Sony PS3 at $400 providing 6 usable SPE core, the Mercury CAB about $8,000 providing 8SPEs, while the IBM QS21 blade is about $10,000 with 16SPEs. High end GPUs have price points around 1 thousand dollar. FPGAs have a high entry point, that was forming the bases of the above table.

Programming effort, compatibility

Last but not least, let me address the necessary programming effort to make use of these acceleration techniques. We have completed our first porting from scalar Intel code to the Cell/B.E. for the eHiTS docking software. We ported about 10% of the code that was responsible for over 98% of the CPU time spent to the SPUs, amounting to a bit over 21,500 lines of code. The total effort — including the learning curve — took about 2 man-years of work. That may seem a lot, but you have to consider not only the learning aspect (the technology was completely new for us when we started), but also that we went down to the lowest assembly level performance tuning, counting individual operation cycles and analyzing every single pipe stall in the tight loops until we got it perfectly streamlined to run at near-peak performance. The vectorization (SIMD data arrangement and operations) would have been necessary also if we target GPUs. The programming of GPUs have traditionally been much more complicated via OpenGL fake graphics calls. Recently, both Nvidia and AMD has issued libraries with more convenient APIs to program the GPUs for generic purpose computations. Nevertheless, you still need to transform the entire code, computation sequence into those API calls. In contrast, you can simply compile your existing C or C++ code for the Cell/B.E. SPU using a variant of the gcc compiler. Of course, if you only do that much, then you will not reach very high performance, your code is still scalar, so all you gain is to run on multiple core (up to 8X performance, but due to branch penalties it is more likely to be around 4X). But the advantage is, that you can start out this way, having your code run about 4 times faster and already on the SPUs with a few weeks of work for a large application. Then you can start profiling where the bulk of the time is spent and focus your efforts to optimize/vectorize only the most important pieces of code. In comparison, both GPU and FPGA require all-or-nothing commitment and effort. The effort required for FPGA is far more significant (several orders of magnitude) because the code has to be taken down way beyond the assembly coding level, all the way to the micro electronics gate logic level.
So, while as I described in our white paper, the Cell/B.E. requires a different kind of thinking and coding than a traditional CPU, the same is true for the GPU and the FPGA and the these later ones require significantly more effort. Another important point is code compatibility and maintenance on multiple platforms. We have done all our vectorization and porting using C++ wrapper classes and functions for which we have two translations: one to the direct Cell/B.E. intrinsic API and another one to simple C scalar code. This way we have a single code base now, that runs both on the Cell/B.E. and on Intel/AMD platform too. In fact, the vectorization have slightly benefited the Intel code too, it runs about 10% faster than before the port. Of course, that is nothing compared to the 50-fold we reached on the Cell/B.E. If you choose GPU or FPGA, then you need to maintain very different code bases for those and for traditional CPUs.

So, I hope I have managed to provide a good overview of the differences between FPGAs, GPUs and the Cell. I’m clearly biased but, I believe, rightly so!

ZZ

15 Responses to “The fast and the furious: compare Cell/B.E., GPU and FPGA”

  1. Jay Fenton Says:

    Great comparison, and good to see the Cell getting some more rightly deserved attention.

    Did you research (or use) any of the frameworks that are starting to appear for coding on Cell? (Mercury’s MCF, IBMs various options..) or just get down ‘n dirty with the intrinsics & asm?

    (My LinkedIn profile has a link to a Cell Developers group… would be delighted if you joined!)

  2. Jay Fenton Says:

    Ah ha, just found the link to allow others to join the group directly:

    Cell Broadband Engine™ Developers

    The Cell/B.E Developers group aims to bring together Cell developer specialists world-wide for the purposes of collaboration, networking and to ease resourcing for projects utilising this unique architecture.

    http://www.linkedin.com/e/gis/95035/00FFF57AD09D

  3. Zsolt Zsoldos Says:

    Thanks for the link Jay, I will definitely join. We will continue to port our existing software to Cell and support it for every new application we develop.

    Before starting the main porting project, we did some small pilots using some of the higher level libraries from the IBM SDK and also tried out the RapidMind solution for the Cell. However, the performance was significantly worse than what we could get with direct intrinsic and DMA coding. So, we decided to get down n’ dirty and used intrinsics, checked the asm code with the static analysis tool from the SDK to make sure we iron out any stalls in the tight loops.

    ZZ

  4. Mike Says:

    On other fairly important point I think ought to be made here is around the ability to rely on the computed answers - what’s the use of all that speed if you can trust it, or have to run several iterations to build confidence?

    That is to say, of the technologies (CPU, GPU, FPGA, and Cell/B.E.) only the CPU’s and Cell/B.E. have protection against Soft Errors (which are undetected corrupted data caused by things like cosmic rays). It’s a difference in a server heritage vs a consumer low cost target. Words like ECC, Parity, and CRC don’t appear in GPU’s (or typically FPGA’s - certainly they could be coded into FPGA’s, but at the cost of gates and effort etc).

  5. SimBioSys Blog » Blog Archive » The future of HPC Says:

    […] Peter Murray Rust is asking on his blog: Where should we get our computing ? The answer is: form the multi-core accelerator technologies, like GPGPU and Cell BE. His worries about hardware cost and management can be reduced by 50-100 fold using these accelerators. It is no accident that the RoadRunner supercomputer is built on Cell BE processors for the computing (with the communication and file I/O being handled by AMD Opterons) beating the previous fastest HPC system benchmark (held by IBM’s BlueGene) by over 4X. As for the GPGPU versus Cell BE angle, this symposium has reinforced my beliefs that the Cell BE is a general purpose accelerator suitable for any task (just like a CPU) while the GPUs from AMD and NVIDIA are highly specialized tools that can get great performance for a very specific subset of the problems. GPUs were designed for graphics, where the computation tasks are massively parallel (millions of 3D points and triangles to process) and completely independent (what needs to appear on each pixel is independent of the others and so is the computation to be performed for different 3D points). Tasks that have these properties are suitable for GPGPU, e.g. image processing, some physics simulations (material science, plasma, laser, particles) and even some chemistry problems, like molecular dynamics simulation if one wants to compute the full atom pair matrix of forces. However, as soon as you want to be smart and compute only forces within a cut-off range and/or need dynamically changing data size or inter-dependencies (like an N-body problem or QM) than GPU is not a good choice. There can be non-trivial performance hurdles even for seemingly fitting problems, like image processing. Michael Kinsner has brought up an example in his talk, where he had to learn the hard way that processing image blocks of 16×4 was fast, but 8×8 was much slower due to some peculiar memory access pattern issue - the input data pattern of the code has to map directly to the underlying hardware architecture to get good performance on the GPU. […]

  6. RandomFeedFollower Says:

    “Mike Says:
    May 9th, 2008 at 1:10 pm

    On other fairly important point I think ought to be made here is around the ability to rely on the computed answers - what’s the use of all that speed if you can trust it, or have to run several iterations to build confidence?

    That is to say, of the technologies (CPU, GPU, FPGA, and Cell/B.E.) only the CPU’s and Cell/B.E. have protection against Soft Errors (which are undetected corrupted data caused by things like cosmic rays). It’s a difference in a server heritage vs a consumer low cost target. Words like ECC, Parity, and CRC don’t appear in GPU’s (or typically FPGA’s - certainly they could be coded into FPGA’s, but at the cost of gates and effort etc).”

    What a load of FUD. Firstly the cosmic rays issue, GPU’s use a similar plastic ball grid array packaging to CPU’s. In fact I’ve yet to see a GPU that doesn’t completely enclose the silicon in a cavity, unlike say the P3 FBGA. FPGA’s? They come in a variety of package options, also plastic, some for commercial applications, some for industrial, and some that are military radiation hardened. While traditional general purpose CPU’s can come in rad-hard packaging, however they’re typically surface mounted and are ~5x the cost of the commercial (for you Mike, the ones you buy at BestBuy when you’ve saved up lawn mowing money)

    Accuracy? While none of the GPU vendors are exact to the IEEE standard with their rounding schemes, however its known that x86 isn’t 100% spot on in all cases. There are methods to correct this precision with respect to floating point unit size, in fact it was invented by Newton a few hundred years before electricity. With regards to FPGA’s, well the precision is your choice or left up to your choice of floating point core. There are dozens of companies that make off the shelf IP (Intellectual Property) cores that are IEEE compliant.

    ECC? Typically you don’t see this in on chip memories, or at least that I’ve ever seen in 12 years of hardware development. As far as off chip memories, depends on what component. You can buy non-ECC or ECC memories, the choice is usually made in the Project Requirements phase of any development project. Do GPU’s use ECC? I’m sure vendors have and do, nVidia’s chips support ECC (at least to my knowledge the GeForce 6, the last datasheet of which I’ve seen). FPGAs? Of course.

    Parity? GPU’s? No, it would almost be stupid to waste IO pins on the overhead given the project constraints. Maybe some vendors do, I’ve never seen it. FPGAs? Definately on every blockram or distributed ram I’ve ever generated on a Xilinx Spartan or Virtex FPGA. At a cost of gates? No, at a cost of one bit of parity per eight bits of data.

    CRC? Why would one implement this in hardware other than for acceleration? Its a bit heavy weight for routine hardware to hardware tests. However it can easily be implemented on a GPU (why one would, I don’t know) and as with most of the BS you’ve spewed, there are probably a dozen vendors that sell IP cores that do this.

    The only real difference between GPU’s and CPU’s is that GPU’s are logic dense with small amounts of on chip memory (typically a ratio of 70%-90% logic and 30%-10% on memory), while almost 60% of modern x86 cpu’s is on chip memory (L1, on die L2 cache, register file). With FPGA’s, it depends on the package selected, but they typically maintain a 90% logic to 10% memory ratio, at least with Xilinx’s line. As a result of this bias towards logic vs on chip memory and simple yet high speed architecture, they can be applied to a variety of engineering solutions and be coupled with a wide range of peripheral components.

    -Burned

  7. Flexy Says:

    great post

  8. Petaflops and Cell Processors at The ChemConnector Blog by Antony Williams - Observations and Musings for the Chemistry Community By Antony Williams Says:

    […] With the fastest computer in the world using the Cell processor as part of its architecture, and with the processor now proving itself for docking, the question is whether we will see this processor become even more mainstream in the foreseeable future. It’s NOT easy to port…but it can be done. addthis_url = ‘http%3A%2F%2Fwww.chemconnector.com%2Fchemunicating%2Fpetaflops-and-cell-processors.html’; addthis_title = ‘Petaflops+and+Cell+Processors’; addthis_pub = ‘’; Posted in Uncategorized June 19th, 2008 | 11:14 pm […]

  9. SimBioSys Blog » Blog Archive » Fastest supercomputer built on the Cell/BE Says:

    […] The last article in the above list highlights: “Roadrunner was built using 6,912 dual-core Opteron processors from Advanced Micro Devices, and 12,960 IBM Cell eDP accelerators. Early tests indicate that the Cell processors have reached 1.33 petaflops while the Opterons reached 49.8 teraflops”. So twice as many Cells produce 26.7 times more crunching power compared to the dual core Opterons. In an earlier blog post, I have analyzed that advantages of the Cell BE over other acceleration technologies, like GPU and FPGA. ZZ […]

  10. SimBioSys Blog » Blog Archive » Bio-IT World article about eHiTS Lightning Says:

    […] Mike May wrote an article to Bio-IT World about eHiTS Lightning and our efforts to accelerate docking on the Cell BE processor. There are more details about the topic in our white paper, and some earlier blog posts here and here. Some people argue that speed is not the most important issue for docking, accuracy is far more crucial. They fail to realize the fact, that speed is a factor that can enable us to use much finer pose sampling and more sophisticated scoring terms and still run at reasonable time frame. Think about it this way: it is well known that quantum chemistry based methods, e.g. free enegy perturbation (FEP) can provide the most accurate binding energy estimation, yet nobody has ever considered using such technique for scoring in docking or virtual screening, simply because it takes many CPU hours to compute the energy for a single ligand pose with FEP while a single docking run requires many thousands, possibly millions of poses to be scored. If SimBioSys as a software vendor would offer a docking software with FEP scoring that requires years of CPU time for a single docking run, nobody would buy such a product. But if we could do FEP-score based docking such that it runs in a few minutes per ligand that would be a “killer application”. […]

  11. SimBioSys Blog » Blog Archive » IBM’s white paper on the Cell technology and Molecular Modeling Says:

    […] The fast and the furious: compare Cell/B.E., GPU and FPGA […]

  12. patrick Says:

    This is just a bit over my head and with that I do have a question: with there strengths in different areas would it be possible to us them together? Like in the past with a math co-processor? Or am I way off base here.

  13. Gregg J. Macdonald Says:

    re: FPGA - I recently announced very basic PC /embedded PC functionality and am offering a developers board with access to IP for those interested in an FPGA solution. As soon as I get all the rest of the main pieces up and running then it becomes easier for everyone else to hook into it . . . and reuse it. Cost of FPGA has good downward pressure.

  14. Cam Says:

    >> In contrast, you can simply compile your
    >> existing C or C++ code for the Cell/B.E.
    >> SPU using a variant of the gcc compiler.
    >> Of course, if you only do that much, then …
    >> …
    >> … already on the SPUs with a few weeks
    >> of work for a large application. Then you
    >> can start profiling where the bulk of the
    >> time is spent …

    It is “the same” for NVidia, ‘write your code
    and run it’; it won’t be fast but it will run.

    To “fix” it you go in and split up the loops into
    threads and work to avoid data transfers…

    One Video Card (from ATI or NVidia) offers
    about a TeraFLOP or computing power, of which
    you can _expect_ 10% and may sometimes get
    30% of that. An extra 300 GFLOPs of additional
    power for your 200 to 300 MFLOPs computer.

    Some motherboards are designed to accept FOUR
    Video Cards (and the Cards are also designed
    to allow pairing and quading) so a “home-user”
    can “easily” have a TeraFLOP on their Desk
    for under $4000.

    The GPU (or “a Video Card”) is on every (new)
    Desktop Computer ready to lend it’s power to
    specially compiled programs available today.

    The “Cell/B.E.” or FPGA are not as good an
    investment for most people since they are
    not required (like a Video Card).

    Your powerful Video Card can be used for
    it’s intended (displaying output FAST on
    your monitor) when it is not being used
    for computation.

    Your Sony PS3 (when not used for computation)
    can double as a Video Game. What one would
    do with a FPGA when it is not used for it’s
    intended purpose it anyone’s guess.

    If the Cell were cheaper in quantity and
    available on a PCI Card it would be much
    better, but that is not to be.

    The FPGA is likely to be the fastest solution
    but not the cheapest. Ultimatley it will
    be the GPU (or 4 of them) that win this race.

    >> In an earlier blog post, I have analyzed
    >> that advantages of the Cell BE over other
    >> acceleration technologies, like GPU and FPGA. ZZ

    The GPU and FPGA (if on a PCI Card) can obtain
    their input Data and transfer the calculation
    result to the Host Computer faster than you
    can send it over a Network from your Sony PS3.

    Cam

  15. cirus Says:

    I’ve heard about OpenCL as a tool where different GPUs ( of different vendors) can be put to work together, can Cell and GeForce be integrated in this way??

Leave a Reply