BCR Portable Performance

The very design of CDS (and BCR, a specific dialect and implementation of it) squeezes every ounce of bandwidth from a system, so many programs will communicate much more efficiently than if written using interfaces like MPI or PVM. There are three basic reasons for this:

Communication within MPI and PVM is based on moving the data from a user-specified buffer in the sender to a user-specified buffer in the receiver, even if the sender and receiver are on the same processor or share access to common memory. In CDS, the user can often allow CDS to dictate where the buffer involved in the communication is stored, so in some cases it can put it into memory accessible to both the sender and the receiver, thereby completely avoiding copying (e.g. by using the same buffer for both the sender and the receiver) and reducing both overall latency and the consumption of memory bandwidth.
Communication within MPI and PVM often utilizes hidden "control" or "housekeeping" messages to ensure that there is sufficient buffer space on the receiver, etc., and these messages and their protocols eat up bandwidth and latency in the network. In CDS, there is aways a user-accessible buffer waiting at the receiving end of any communication, so these messages can often be avoided. The network latency and bandwidth (aside from a tiny amount used by CDS-specific message headers) is delivered directly to the user application.
Communication within MPI and PVM includes processing of the data through a "type", adding to the latency (or "overhead") of communication. In CDS, the user application can access the raw bytes being transferred, and when dealing in a homogeneous environment, this is often all that is required. The overhead of processing the type can thereby often be omitted completely, and in some other cases performed concurrently with other communication.

These are illustrated by the very simple "ring" program which is provided as part of the BCR "test" directory. It just passes data regions (MPI and PVM would call them "messages") of different sizes (10 bytes, 100 bytes, 1000 bytes, and 10000 bytes) around a ring of CCEs (effectively "processes"). It starts by reading a file that tells it how many CCEs (i.e. processes) to put on each processor, then starts those CCEs, and every CCE then performs a very simple loop consisting of alternating puts and a gets to neighboring CCEs.

The platform for this test consisted of two off-the-shelf computers connected with a 10BaseT ethernet hub. One computer is a 2-processor Pentium III SMP PC, running at 850MHz/processor, the other is a Mac Powerbook G3 ("Wallstreet"), running at 266 MHz.

The program was run with four different configurations:

2 CCEs ("processes"), both on the 2-processor SMP PC
2 CCEs, one on the PC and one on the Mac.
3 CCEs, all on the 2-processor SMP PC.
3 CCEs, two on the SMP PC and one on the Mac.

Here is a graph of the "Effective Bandwidth" result obtained by averaging 3 runs in each configuration, with each result calculated as the region size (in Bytes/region) times the number of regions passes (in regions) divided by the elapsed time (in seconds) to yield bandwidth (Bytes/sec).

Note that the red and green lines at the top are linear without bound, parallel, and very steep. This is because, on an SMP (or single processor), the time to "pass" a region is a constant (i.e. the time to pass a pointer), and completely independent of the region size. For the red line, this fixed latency is about 2.8 microsecs, and for the green line it is about twice that (because two of the three CCEs are effectively competing for one of the two processors at all times). The (non-logarithmic) slope of those lines are equal to the inverses of those times, since bandwidth (y) equals bytes (x) divided by this time. As a result, for regions of 10000 bytes, the effective bandwidths on this platform are enormously high, 1.8 GB/sec and 3.7 GB/sec, and would be even higher for larger regions or faster platforms. You will not see performance like that from PVM or MPI in any configuration, even working through shared memory, because they must move data to communicate.

The magenta line at the bottom represents simple transfer over the ethernet. BCR internally utlizes UDP/IP communications, and dynamically tunes network parameters (such as time-outs, window-size, and acks) to conform to current network and application needs. In this case, it delivers almost the entire bandwidth of the 10BaseT ethernet to the application, with region sizes of just 1000 bytes each. This is superior to PVM and MPI in similar circumstances (e.g. see links below). The blue line is a hybrid configuration, communicating through shared memory (by just passing pointers) when possible, and communicating over the ethernet when necessary, all completely invisible to the program. (Note that, in this case, one traversal of the ring consists of two traversals of the ethernet plus one through fixed-latency shared memory.) Use of a such a hybrid approach using other programming methodologies (e.g. MPI + OpenMP) would require significant integration of underlying software tools and extreme exposure of complexity to the programmer.

"Ring" is a very simple program, and may not be a good predictor of the performance you will see when using BCR. For example, no conversion or translation is performed in this program while passing regions around the ring, but in general, BCR allows those operations to be used sparingly, especially when working among homogeneous processors like those in an SMP.

Some similar (though not identical) statistics for MPI and PVM can be found at http://www.hpcf.upr.edu/~humberto/documents/mpi-vs-pvm/. Please be aware of potentially different metrics (e.g. Mbits/sec vs. MBytes/sec) when comparing. (Please let us know of links to other pertinent performance comparisons.)

Back to BCR (CDS Product page)