Parallel Computation - using the VAMPIR tool on Data Star - Part II

Introduction

The speed of a message-passing parallel code depends on the performance of both the local hosts and of the message passing environment. Optimization of parallel code is usually carried out in an iterative process involving several tools to investigate performance issues. Many of the computational optimizations are no different from the ones needed for a serial code.

The main prerequisite for constructing a parallel algorithm is identification of a significant fraction of the computation that can be done by tasks that execute concurrently. They should be able to proceed through a nontrivial piece of work without the need to exchange data with other task. In other words, the data to be used by a task for such a piece of the calculation must be locally available.

Sometimes this can be achieved by carrying out loops over different data structures in different tasks, but more commonly a single loop is spread over multiple tasks using the strategy called "domain decomposition" where the data on which the calculation is to be performed are divided into compact groups and one group is assigned to each task. The latter is an example of a data parallel strategy.

Whatever the parallelization strategy, there is not usually a need to maintain a copy of the full problem of data on each node. Ideally, most of the data stored on each node are used only in the computations that are performed there. This is what we mean by "locality of reference". If the quantity of data that must be shared with other nodes is not too large and doesn't have to be updated too frequently, the code is said to be "scalable" in the sense that a larger problem can be solved in the same time as a smaller one if the number of nodes is increased in such a way that the amount of data and computation on each node is constant. Another advantage of storing only locally needed data on each node is better memory performance. If one is using only sections of a large array, it is very likely that unused data in the array will get drawn into the memory cache along with the elements that are to be used. This is wasteful of cache space and will result in more cache misses.

If few data need to be replicated across multiple nodes, it is often possible to deal with problems that are larger than would fit on a single node. The number of nodes needed for solving some problems is determined less by a need for CPU speed than by their aggregate memory capacity.

Load Balancing

If an algorithm is truly data parallel, load balancing may be trivial. Each node takes an equal amount of time to process its share of the data. However, this is not always true. The parallelization strategy may involve having some nodes doing different tasks from others. In this case, the performance of a parallel code is often determined by how well the issue of load balancing has been addressed.

Even though nodes may be doing different things, if their requirements are comparable, the load balance can be good. If the program can be divided into many small tasks, it may be possible to use a flexible strategy involving a master that parcels out work to its workers each time one completes a task. This strategy runs counter to the concept of data locality, so it is useful only for those problems that can be divided into tasks that don't use many data.
In many cases, the time to process equal amounts of local data can vary widely. In such cases, it may be necessary to periodically redistribute data to equalize the load on the nodes.

The Global Activity Chart display of the VAMPIR utility shows a statistic about the time spent in each activity individually for each process defined in the tracefile. With the default pie chart display you can recognise load imbalance at a glance in the trrace program by comparing the different time comsumption of activities over all processes (in our example MPI). VAMPIR can assist the user by visualising the actual trace data in different chart modes. Depending on the users preference the activity chart display can be switched to the so called "Histogram" mode [right click inside the activity chart, select "Mode" and then "Histogram"].

We will use the Golbal Activity Chart to analize the load balance by running the etch module on 4 and 8 nodes. The load balancing information is presented in table I.

Table I: Load Balancing Information
Nodes 4 8 16
Bottom points 12880 12880 6440
Top points 30174 12840 6420
So, running on 4 nodes implies a significant load imbalance.

VAMPIR Quick Operation

Login on the Data Star using the secure shell command, on LINUX and UNIX environment:

        % ssh -lUserID dslogin.sdsc.edu 

Step 1: Transfer the files from the HPSS [High Performance Storage System] to the Data Star

To transfer the 'etch' modules and auxiliary files to the Data Star, you have to use 'pftp'. 'pftp' is an ftp-like
interface to HPSS. The followings links and text below describe how to use the 'pftp' utility:

        % pftp 
    pftp> cd ..
    pftp> cd u4078 
    pftp> get etch-ldvamp.tar 
    pftp> quit 
        %   

Step 2: Compiling FORTRAN programs on the Data Star

In order to compile and run the etch module, you have to untar the file etch-ldvamp.tar:

        %tar xvf etch-ldvamp.tar
In the directory etch-ldvamp, the following files will be present:
        - Makefile
        - etch.f
        - etchh.f
        - etchn.f
        - input4
        - input8
In order to compile the etch module, change to the etch-ldvamp directory and run make:
        % cd etch-loadvamp
        % make  

Step 3: Running programs on the Data Star

To run interactively on the dspoe nodes, you must first log in as follows:

      % ssh -lUserID dspoe.sdsc.edu

To run the 'etch' module with the VAMPIR tool, set the paths to the VAMPIR tool:

        % set PAL_ROOT=/usr/local/apps/vamp
        % set PAL_LICENSEFILE=/usr/local/apps/vamp/license.dat 
        % set path = ($path /usr/local/apps/vamp/bin)
and then invoke the parallel program using the POE command-line flags:
      % poe etch -nodes 1 -tasks_per_node 4 -rmpool 1 -euilib ip -euidevice en0
The file "input4" must be in the directory in which you are running 'etch'. The program executes, generating a trace file called etch.bpv.
Rename etch.bpv to etch4.bpv:
      % mv etch.bpv etch4.bpv

To run etch using 8 mp tasks per node, change MP_TASKS_PER_NODE 4 to MP_TASKS_PER_NODE 8. Also modify etchh.f so that input8 is read instead of input4 and out8 is written instead of out4 (line 47 and 48).

Step 4: Starting the VAMPIR session

After the program has finished executing, start a VAMPIR session:

         Set the DISPLAY variable to the name of the machine you are login
         [for example "linux9"]:
              % setenv DISPLAY linux9.engr.ucsb.edu:0.0
         Enter:   
              % vampir etch4.bpv
The VAMPIR Main window and the Global Timeline Display window open.
          Open the Global Activity Chart window.  
          Analyse the load balance.
          Repeat with etch8.bpv.

Step 5: Exit the VAMPIR session