Message passing in computer clusters

Technicians working on a cluster consisting of many computers working together by sending messages over a network

supercomputers in the world, rely on message passing to coordinate the activities of the many nodes they encompass.^[1]^[2] Message passing in computer clusters built with commodity servers and switches is used by virtually every internet service.^[1]

Recently, the use of computer clusters with more than one thousand nodes has been spreading. As the number of nodes in a cluster increases, the rapid growth in the complexity of the communication subsystem makes message passing delays over the
parallel programs.^[3]

Specific tools may be used to simulate, visualize and understand the performance of message passing on computer clusters. Before a large computer cluster is assembled, a
log files and simulates the performance of the messaging subsystem when many more messages are exchanged between a much larger number of nodes.^[4]^[5]

Messages and computations

Approaches to message passing

Historically, the two typical approaches to communication between cluster nodes have been PVM, the Parallel Virtual Machine and MPI, the Message Passing Interface.^[6] However, MPI has now emerged as the de facto standard for message passing on computer clusters.^[7]
PVM predates MPI and was developed at the
C, C++, or Fortran, etc.^[6]^[8]

Unlike PVM, which has a concrete implementation, MPI is a specification rather than a specific set of libraries. The specification emerged in the early 1990 out of discussions between 40 organizations, the initial effort having been supported by
TCP/IP and socket connections.^[6] MPI is now a widely available communications model that enables parallel programs to be written in languages such as C, Fortran, Python, etc.^[8] The MPI specification has been implemented in systems such as MPICH and Open MPI.^[8]^[9]

Testing, evaluation and optimization

Apple computers

Computer clusters use a number of strategies for dealing with the distribution of processing over multiple nodes and the resulting communication overhead. Some computer clusters such as
FeiTeng-1000 processors to enhance the operation of its proprietary message passing system, while computations are performed by Xeon and Nvidia Tesla processors.^[10]^[11]

One approach to reducing communication overhead is the use of local neighborhoods (also called locales) for specific tasks. Here computational tasks are assigned to specific "neighborhoods" in the cluster, to increase efficiency by using processors which are closer to each other.^[3] However, given that in many cases the actual topology of the computer cluster nodes and their interconnections may not be known to application developers, attempting to fine tune performance at the application program level is quite difficult.^[3]
Given that MPI has now emerged as the de facto standard on computer clusters, the increase in the number of cluster nodes has resulted in continued research to improve the efficiency and scalability of MPI libraries. These efforts have included research to reduce the memory footprint of MPI libraries.^[7]
From the earliest days MPI provided facilities for performance profiling via the PMPI "profiling system".^[12] The use of the PMIPI- prefix allows for the observation of the entry and exit points for messages. However, given the high level nature of this profile, this type of information only provides a glimpse at the real behavior of the communication system. The need for more information resulted in the development of the MPI-Peruse system. Peruse provides a more detailed profile by enabling applications to gain access to state-changes within the MPI-library. This is achieved by registering callbacks with Peruse, and then invoking them as triggers as message events take place.^[13] Peruse can work with the PARAVER visualization system. PARAVER has two components, a trace component and a visual component for analyze the traces, the statistics related to specific events, etc. ^[14] PARAVER may use trace formats from other systems, or perform its own tracing. It operates at the task level, thread level, and in a hybrid format. Traces often include so much information that they are often overwhelming. Thus PARAVER summarizes them to allow users to visualize and analyze them.^[13]^[14]^[15]

Performance analysis

When a large scale, often supercomputer level, parallel system is being developed, it is essential to be able to experiment with multiple configurations and simulate performance. There are a number of approaches to modeling message passing efficiency in this scenario, ranging from analytical models to trace-based simulation and some approaches rely on the use of test environments based on "artificial communications" to perform synthetic tests of message passing performance.^[3] Systems such as BIGSIM provide these facilities by allowing the simulation of performance on various node topologies, message passing and scheduling strategies.^[4]

Analytical approaches

At the analytical level, it is necessary to model the communication time T in term of a set of subcomponents such as the startup
point to point communication, using T = L + (M / R) where M is the message size, L is the startup latency and R is the asymptotic bandwidth in MB/s.^[16]

Xu and Hwang generalized Hockney's model to include the number of processors, so that both the latency and the asymptotic bandwidth are functions of the number of processors.[16]^[17] Gunawan and Cai then generalized this further by introducing cache size, and separated the messages based on their sizes, obtaining two separate models, one for messages below cache size, and one for those above.^[16]

Performance simulation

IBM Roadrunner
cluster supercomputer

Specific tools may be used to simulate and understand the performance of message passing on computer clusters. For instance, CLUSTERSIM uses a Java-based visual environment for
job scheduling algorithms to be proposed and experimented with. The communication overhead for MPI message passing can thus be simulated and better understood in the context of large-scale parallel job execution.^[18]

Other simulation tools include MPI-sim and BIGSIM.[19] MPI-Sim is an execution-driven simulator that requires C or C++ programs to operate.^[18]^[19] ClusterSim, on the other hand uses a hybrid higher-level modeling system independent of the programming language used for program execution.^[18]
Unlike MPI-Sim, BIGSIM is a trace-driven system that simulates based on the logs of executions saved in files by a separate emulator program.^[5]^[19] BIGSIM includes an emulator, and a simulator. The emulator executes applications on a small number of nodes and stores the results, so the simulator can use them and simulate activities on a much larger number of nodes.^[5] The emulator stores information of sequential execution blocks (SEBs) for multiple processors in log files, with each SEB recording the messages sent, their sources and destinations, dependencies, timings, etc. The simulator reads the log files and simulates them, and may star additional messages which are then also stored as SEBs.^[4]^[5] The simulator can thus provide a view of the performance of very large applications, based on the execution traces provided by the emulator on a much smaller number of nodes, before the entire machine is available, or configured.^[5]

See also

High-availability cluster

Performance simulation

Supercomputing

References

^
ISBN 0123747503 page 641 [1]

ISBN 0262692759
MIT Press pages 7–9

^
ISBN 3642244483
pages 160–162

^
ISBN 1584889098
pages 435–435

^
ISBN 3642195946
pages 202–203

^ ^a ^b ^c Distributed services with OpenAFS: for enterprise and education by Franco Milicchio, Wolfgang Alexander Gehrke 2007, pp. 339-341

^
ISBN 3642037690
page 231

^
ISBN 8120334280
pages 109–112

doi:10.1016/0167-8191(96)00024-5
.

^ The TianHe-1A Supercomputer: Its Hardware and Software by Xue-Jun Yang, Xiang-Ke Liao, et al in the Journal of Computer Science and Technology, Volume 26, Number 3, May 2011, pages 344–351 "Archived copy". Archived from the original on 2011-06-21. Retrieved 2012-02-08.{{cite web}}: CS1 maint: archived copy as title (link)

^ U.S. says China building 'entirely indigenous' supercomputer, by Patrick Thibodeau Computerworld, November 4, 2010 [2]

^ What is PMPI?

^
ISBN 354039110X
page 347

^ ^a ^b PARAVER: A Tool to Visualize and Analyze Parallel Code by Vincent Pillet et al, Proceedings of the conference on Transputer and Occam Developments, 1995, pages 17–31

ISBN 3540401970
page 183

^
ISBN 3540338098
pages 299–307

ISBN 3540644431
page 935

^
ISBN 0387240489
pages 59–63

^
ISBN 3642233236
page 16

v
t
e
Parallel computing
General

Distributed computing

Parallel computing

Massively parallel

Cloud computing

High-performance computing

Multiprocessing

Manycore processor

GPGPU

Computer network

Systolic array

Levels

Bit

Instruction

Thread

Task

Data

Memory

Loop

Pipeline

Multithreading

Temporal

Simultaneous (SMT)

Simultaneous and heterogenous

Speculative (SpMT)

Preemptive

Cooperative

Clustered multi-thread (CMT)

Hardware scout

Theory

PRAM model

PEM model

Analysis of parallel algorithms

Amdahl's law

Gustafson's law

Cost efficiency

Karp–Flatt metric

Slowdown

Speedup

Elements

Process

Thread

Fiber

Instruction window

Array

Coordination

Multiprocessing

Memory coherence

Cache coherence

Cache invalidation

Barrier

Synchronization

Application checkpointing

Programming

Stream processing

Dataflow programming

Models
Implicit parallelism

Explicit parallelism

Concurrency

Non-blocking algorithm

Hardware

Flynn's taxonomy
SISD

SIMD
Array processing (SIMT)

Pipelined processing

Associative processing

MISD

MIMD

Dataflow architecture

Pipelined processor

Superscalar processor

Vector processor

Multiprocessor
symmetric

asymmetric

Memory
shared

distributed

distributed shared

UMA

NUMA

COMA

Massively parallel computer

Computer cluster
Beowulf cluster

Grid computer

Hardware acceleration

APIs

Ateji PX

Boost

Chapel

HPX

Charm++

Cilk

Coarray Fortran

CUDA

Dryad

C++ AMP

Global Arrays

GPUOpen

MPI

OpenMP

OpenCL

OpenHMPP

OpenACC

Parallel Extensions

PVM

pthreads

RaftLib

ROCm

UPC

TBB

ZPL

Problems

Automatic parallelization

Deadlock

Deterministic algorithm

Embarrassingly parallel

Parallel slowdown

Race condition

Software lockout

Scalability

Starvation

Category: Parallel computing

Retrieved from "https://en.wikipedia.org/w/index.php?title=Message_passing_in_computer_clusters&oldid=1180742196"

[Patterson641-1] 
ISBN 0123747503 page 641 [1]

[SterlingW7-2] ISBN 0262692759
MIT Press pages 7–9

[Jack2011-3] 
ISBN 3642244483
pages 160–162

[Bader-4] 
ISBN 1584889098
pages 435–435

[Cooper-5] 
ISBN 3642195946
pages 202–203

[Gehrke-6] Distributed services with OpenAFS: for enterprise and education by Franco Milicchio, Wolfgang Alexander Gehrke 2007, pp. 339-341

[Matti231-7] 
ISBN 3642037690
page 231

[Prabhu-8] 
ISBN 8120334280
pages 109–112

[Gropp-9] :10.1016/0167-8191(96)00024-5
.

[HT1-10] The TianHe-1A Supercomputer: Its Hardware and Software by Xue-Jun Yang, Xiang-Ke Liao, et al in the Journal of Computer Science and Technology, Volume 26, Number 3, May 2011, pages 344–351 "Archived copy". Archived from the original on 2011-06-21. Retrieved 2012-02-08.{{cite web}}: CS1 maint: archived copy as title (link)

[cw-11] U.S. says China building 'entirely indigenous' supercomputer, by Patrick Thibodeau Computerworld, November 4, 2010 [2]

[12] What is PMPI?

[Mohr-13] 
ISBN 354039110X
page 347

[Paraver-14] PARAVER: A Tool to Visualize and Analyze Parallel Code by Vincent Pillet et al, Proceedings of the conference on Transputer and Occam Developments, 1995, pages 17–31

[15] ISBN 3540401970
page 183

[overh-16] 
ISBN 3540338098
pages 299–307

[17] ISBN 3540644431
page 935

[clustersim-18] 
ISBN 0387240489
pages 59–63

[Lin16-19] 
ISBN 3642233236
page 16

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]