MP_Lite
A lightweight message-passing library
High-performance - Portable - User-friendly
(NOTE: This project is no longer supported.
Comments may be directed to DrDaveTurner@gmail.com)



MP_Lite is a lightweight message-passing library designed to deliver the maximum performance to applications in a portable and user-friendly manner. It is a small package that compiles in a few seconds, making it ideal for running simple MPI applications. MP_Lite is being used as a research tool to investigate methods to improve the performance, capabilities, and ease of use of all message-passing systems.

How MP_Lite works

MP_Lite supports a subset of the basic MPI functions including blocking and asynchronous sends and receives, 1-sided get and put calls, and common global operations such as broadcast, synchronization, gather, accumulate, sum, min, and max. It does not support more advanced operations like the use of communicators other than MPI_COMM_WORLD, derived datatypes, and advanced I/O functions.

Using MP_Lite, MPI applications can run unmodified on top of TCP or VIA (GigaNet hardware or M-VIA on Gigabit Ethernet) on PC/workstation clusters, use the high-performance native SHMEM library on Cray T3E and SGI Origin systems, or take advantage of shared-memory segments to provide great performance on SMP machines. MP_Lite can also be compiled in 'scalar' mode, which cleanly removes all the message passing calls so the code can be run on a scalar machine. An InfiniBand module is under development that ties into the Mellanox VAPI layer.

MP_Lite has convenient timing and debug functions that are much better than what most MPI implementations provide. The mprun launch script is designed to be much more flexible and user-friendly.

MP_Lite Performance

Performance is achieved by keeping everything simple and clean. Blocking and asynchronous sends and receives avoid buffering at all costs. The TCP module uses a unique SIGIO interrupt driven approach that delivers all the raw TCP performance to applications while maintaining message progress at all times. The VIA module bypasses the operating system to reduce the latency and streamline the data transfer. The SHMEM version uses one-sided gets and puts to move data directly from user-space on one node to user-space on the other, creating a send buffer only when necessary to avoid lockup conditions. The SMP module under development uses shared-memory segments for communications between processors within the same machine. Work is underway on a Linux kernel_copy.c module that will allow for a 1-copy SMP message passing system.

Native capabilities
bandwidth & latency
MP_Lite
MPI
SMP
1.7 GHz dual-Xeon
memcpy rate
40,000 Mbps peak
8000 Mbps large message
2 µs latency
9000 Mbps peak
4000 Mbps large message
1-48 µs latency
3000-7000 Mbps peak
2500-3000 Mbps large message
Channel-Bonded
Gigabit Ethernet
Linux kernel bonding.c
2 GigE cards is worse than 1
MP_Lite sockets
1875 Mbps with 2 cards
2300 Mbps with 3 cards
NA
Channel-Bonded
Fast Ethernet
Linux kernel bonding only
180 Mbps with 2 cards


MP_Lite M-VIA
180 Mbps with 2 cards
270 Mbps with 3 cards
315 Mbps with 4 cards
On Linux kernel bonding.c
~180 Mbps with 2 cards


Cray T3E
2740 Mbps
2-3 µs
2570 Mbps
9 µs
2400 Mbps
5 µs

Channel Bonding

The TCP module can stripe data across multiple network interface cards. Channel bonding two Fast Ethernet cards in a PC cluster doubles the performance for a negligible increase in price. Channel bonding at the Linux kernel level provides the same performance, but only for Linux systems. The M-VIA module has been used to channel bond up to 4 Fast Ethernet cards, providing a 3.5 times speedup due to the lower latency of M-VIA.

Gigabit Ethernet provides even better performance at a low cost, so channel bonding multiple Fast Ethernet cards is no longer important. Channel bonding 2 Gigabit Ethernet cards provides an ideal doubling of the throughput. Adding a 3rd GigE card improves the throughput by only 45% of ideal. Channel bonding multiple GigE cards using the M-VIA module has not been tried yet. Channel bonding 2 GigE cards using the Linux kernel bonding.c module produces poorer results than using just a single GigE card, so other MPI implementations do not currently take advantage of channel bonding.

User-Friendly Approach

MP_Lite is small, and compiling it takes under a minute, so it is easy to take to any system. There are several debugging and trace options that can also be compiled in to help track down message passing problems and tune performance.

The MP_Enter() and MP_Leave() functions provide an easy way to time critical sections of code. A call to MP_Time_Report() at the end dumps a breakdown of how much time was spent in each section, and it can be used to report MFlops rates or communication rates for each section if desired.

Most message-passing systems do not tell you what is wrong if a communication buffer overflows or a node is waiting for a message that never gets sent. MP_Lite operates with minimal buffering, and warns if there are any potential problems. When possible, MP_Lite will dump warnings to a log file and eventually time-out when a lock-up occurs.

Current Status

MP_Lite has been tested on a variety of Unix machines from both C and Fortran programs, and most modules are pretty stable. The via.c module has not been used in a while, and the ib.c module is still being tuned. Performance under AIX is very poor due to the reliance on the SIGIO interrupt which takes 50 ms to propagate under AIX. Version 2.7 is distibuted under a GPL license now, and has some warnings fixed from the PGI compilers.

Download a copy, I dare you!

MP_Lite 2.7

Bugs or Comments?

Contact Dave Turner through email at DrDaveTurner@gmail.com.

MP_Lite Documentation

Future Work

MP_Lite is a research project in itself. Work is continuing on investigating better methods for doing SMP message passing, including lock-free methods that scale better than current semaphore-based mechanisms. For Linux systems, a kernel_copy.c module will allow for 1-copy message- passing on SMP systems. That work will progress toward allowing 0-copy exchanges for some messages.

The techniques learned from developing the InfiniBand module are being used to help the LAM MPI development team to build an InfiniBand module for their full MPI implementation.