MP_Lite
A lightweight message-passing library
High-performance - Portable - User-friendly
(NOTE: This project is no longer supported.
Comments may be directed to DrDaveTurner@gmail.com)
MP_Lite is a lightweight message-passing
library designed to deliver the maximum performance to applications
in a portable and user-friendly manner. It is a small package that
compiles in a few seconds, making it ideal for running simple MPI
applications. MP_Lite is being used as a research tool to investigate
methods to improve the performance, capabilities, and ease of use of
all message-passing systems.
How MP_Lite works
MP_Lite supports a subset of the basic MPI functions including
blocking and asynchronous sends and receives,
1-sided get and put calls, and common global operations
such as broadcast, synchronization, gather, accumulate, sum, min, and max.
It does not support more advanced operations like the use of communicators
other than MPI_COMM_WORLD, derived datatypes, and advanced I/O functions.
Using MP_Lite, MPI applications can run unmodified on top of TCP or
VIA (GigaNet hardware or M-VIA on Gigabit Ethernet) on PC/workstation clusters,
use the high-performance native SHMEM library on Cray T3E
and SGI Origin systems, or
take advantage of shared-memory segments to provide great performance
on SMP machines.
MP_Lite can also be compiled in 'scalar' mode, which cleanly removes all
the message passing calls so the code can be run on a scalar machine.
An InfiniBand module is under development that ties into
the Mellanox VAPI layer.
MP_Lite has convenient timing and debug functions that are
much better than what most MPI implementations provide. The mprun launch
script is designed to be much more flexible and user-friendly.
MP_Lite Performance
Performance is achieved by keeping everything simple and clean.
Blocking and asynchronous sends and receives avoid buffering at all costs.
The TCP module uses a unique SIGIO interrupt driven approach that
delivers all the raw TCP performance to applications while
maintaining message progress at all times.
The VIA module bypasses the operating system to reduce the latency
and streamline the data transfer.
The SHMEM version uses one-sided gets and puts to move data
directly from user-space on one node to user-space on the other,
creating a send buffer only when necessary to avoid lockup conditions.
The SMP module under development uses shared-memory segments for
communications between processors within the same machine.
Work is underway on a Linux kernel_copy.c module that will allow for
a 1-copy SMP message passing system.
|
Native capabilities
bandwidth & latency
|
MP_Lite
|
MPI
|
SMP
1.7 GHz dual-Xeon
|
memcpy rate
40,000 Mbps peak
8000 Mbps large message
|
2 µs latency
9000 Mbps peak
4000 Mbps large message
|
1-48 µs latency
3000-7000 Mbps peak
2500-3000 Mbps large message
|
Channel-Bonded
Gigabit Ethernet
|
Linux kernel bonding.c
2 GigE cards is worse than 1
|
MP_Lite sockets
1875 Mbps with 2 cards
2300 Mbps with 3 cards
|
NA
|
Channel-Bonded
Fast Ethernet
|
Linux kernel bonding only
180 Mbps with 2 cards
|
MP_Lite M-VIA
180 Mbps with 2 cards
270 Mbps with 3 cards
315 Mbps with 4 cards
|
On Linux kernel bonding.c
~180 Mbps with 2 cards
|
Cray T3E
|
2740 Mbps
2-3 µs
|
2570 Mbps
9 µs
|
2400 Mbps
5 µs
|
Channel Bonding
The TCP module can stripe data across multiple network interface cards.
Channel bonding two Fast Ethernet cards in a PC cluster
doubles the performance for a negligible increase in price.
Channel bonding at the Linux kernel level provides the same performance,
but only for Linux systems.
The M-VIA module has been used to channel bond up to 4 Fast Ethernet cards,
providing a 3.5 times speedup due to the lower latency of M-VIA.
Gigabit Ethernet provides even better performance at a low cost, so
channel bonding multiple Fast Ethernet cards is no longer important.
Channel bonding 2 Gigabit Ethernet cards provides an ideal doubling of
the throughput. Adding a 3rd GigE card improves the throughput
by only 45% of ideal.
Channel bonding multiple GigE cards using the M-VIA module has not been
tried yet.
Channel bonding 2 GigE cards using the Linux kernel bonding.c module
produces poorer results than using just a single GigE card, so
other MPI implementations do not currently take advantage of channel bonding.
User-Friendly Approach
MP_Lite is small, and compiling it takes under a minute, so it is easy
to take to any system.
There are several debugging and trace options that can also be compiled in
to help track down message passing problems and tune performance.
The MP_Enter() and MP_Leave() functions provide an easy way to time
critical sections of code.
A call to MP_Time_Report() at the end dumps a breakdown of how
much time was spent in each section, and it can be used to report
MFlops rates or communication rates for each section if desired.
Most message-passing systems do not
tell you what is wrong if a communication buffer overflows or a node is
waiting for a message that never gets sent. MP_Lite operates with
minimal buffering, and warns if there are any potential problems.
When possible, MP_Lite will dump warnings to a log file and
eventually time-out when a lock-up occurs.
Current Status
MP_Lite has been tested on a variety of Unix machines from both C and Fortran
programs, and most modules are pretty stable. The via.c module has not been
used in a while, and the ib.c module is still being tuned.
Performance under AIX is very poor due to the reliance on the SIGIO interrupt
which takes 50 ms to propagate under AIX. Version 2.7 is distibuted
under a GPL license now, and has some warnings fixed from the PGI compilers.
Download a copy, I dare you!
MP_Lite 2.7
Bugs or Comments?
Contact Dave Turner through email at
DrDaveTurner@gmail.com.
MP_Lite Documentation
Future Work
MP_Lite is a research project in itself.
Work is continuing on investigating better
methods for doing SMP message passing, including lock-free methods that
scale better than current semaphore-based mechanisms.
For Linux systems, a kernel_copy.c module will allow for 1-copy message-
passing on SMP systems. That work will progress toward allowing 0-copy
exchanges for some messages.
The techniques learned from developing the InfiniBand module are
being used to help the LAM MPI development team to build an InfiniBand
module for their full MPI implementation.