Notes & Examples - TRITON
Hardware Overview
Software Overview

Login to TRITON:

 ssh -l your_user_name tscc-login.sdsc.edu 

How to generate SSH keys:

 Windows: 
    http://kb.site5.com/shell-access-ssh/how-to-generate-ssh-keys-and-connect-to-your-account-with-putty/
    http://wiki.joyent.com/wiki/display/jpc2/Manually+Generating+Your+SSH+Key+in+Windows

 MAC OS X: 
    http://wiki.joyent.com/wiki/display/jpc2/Manually+Generating+your+SSH+Key+in+Mac+OS+X

After generating the SSH keys, please send the PUBLIC KEY to stefan@ucsb.edu
 

X Window System Server for Windows:

 http://sourceforge.net/projects/xming/

Xming is the leading X Window System Server for Microsoft Windows 8/7/Vista/XP. 
It is fully featured, small and fast, simple to install and because it is standalone 
native Microsoft Windows, easily made portable (not needing a machine-specific installation).

WinSCP:

 http://winscp.net/eng/index.php

WinSCP is an open source free SFTP client, FTP client, WebDAV client and SCP client 
for Windows. Its main function is file transfer between a local and a remote computer. 
Beyond this, WinSCP offers scripting and basic file manager functionality.

Accounting on TRITON:

   gbalance -u username  

File transfer to/from TRITON:

To copy the file "pi.c" from a ENGR machine to TRITON:
 
          scp pi.c your_SDSC_username@tscc-login.sdsc.edu:pi.c 

To copy the file pi.c from TRITON to the ENGR domain machines: 

          scp pi.c your_ENGR_username@linux.engr.ucsb.edu:pi.c

	

Modules

Here are some common module commands and their descriptions:

    module list - List the modules that are currently loaded
    module avail - List the modules that are available
    module display "module_name" - Show the environment variables used by
                   "module name" and how they are affected
    module unload "module name" - Remove "module name" from the environment
    module load "module name" - Load "module name" into the environment
    module switch "module 1 name" "module 2 name" - Replace "module 1 name"
                 with "module 2 name" in the enviornment

Compiling

In general, the login node should be used only to edit, compile software and to submit jobs to the scheduler.
NEVER RUN A JOB ON THE LOGIN NODE. Jobs should be run only on Triton's compute nodes.

Serial program
 Compile your programs with pgcc, pgf77, and pgf90 (Portland Group 
Compilers), or icc, ifort (Intel Compilers) or gcc, g77, gfortran 
(GNU Compilers). 

      icc [options] file.c 	C
      ifort [options] file.f 	Fortran 

Example:
      % 
      % icc -o serial serial.c
      %

MPI program
  MPI source codes should be recompiled for the Triton system with the
following compiler commands:

      mpicc [options] file.c	C & C++ [myrinet/mx switch & Portland Compiler] 
      mpif77 [options] file.f	Fortran 77 [myrinet/mx switch & Portland Compiler]
      mpif90 [options] file.f90 Fortran 90 [myrinet/mx switch & Portland Compiler}

OPENMP program
  OPENMP source codes should be recompiled for the Triton system with the
following compiler commands:

      module purge
      module load intel
      module load openmpi_mx
      icpc -openmp -o execfile -o file.c 

  To run:
     export OMP_NUM_THREADS=8
     ./execfile

OPENMP-HYBRID program
  OPENMP source codes should be recompiled for the Triton system with the
following compiler commands:

      module purge
      module load intel
      module load openmpi_mx
      mpicc -openmp -o execfile -o file.c 
 
  To run:
     export OMP_NUM_THREADS=8
     mpirun -machinefile $PBS_NODEFILE -np 2 ./execfile

CILK
by Veronica Strnadova - Computer Science Department


The Intel Cilk Plus SDK, which provides the Cilk screen race detector and Cilk view 
scalability analyzer, can be doawnloaded from Intel Cilk Plus SDK 

Here is an example Cilk program:  simplecilkprogram.cpp 

To compile, use icc, like this:
 icc simplecilkexample.cpp -o simplecilkexample 

Then, to run, type:
 ./simplecilkexample 

You should see output that looks like this:
 result=832040

Now, to run the Cilk screen race detector, you just need to know where the cilkscreen 
executable is, and run it with "simplecilkprogram" as an argument. If cilkscreen is 
under: /opt/cilkutil/bin/, then we can run cilkscreen like this:
/opt/cilkutil/bin/cilkscreen simplecilkprogram 

We get the following output:

Cilkscreen Race Detector V2.0.0, Build 3229
result=832040
No errors found by Cilkscreen


Similarly, we can run the Cilk view scalability analyzer like this:
/opt/cilkutil/bin/cilkview simplecilkprogram 

And we get output that starts with:

Cilkview: Generating scalability data
Cilkview Scalability Analyzer V2.0.0, Build 3229
result=832040

The output goes on to report a "Parallelism Profile" and a "Speedup Estimate" for both the 
program as a whole and the "parallel region" of the program. 
   
Here is a link to a short summary of the cilkscreen and cilkview tools:Cilk Tools Tutorial

And here is a link to the Cilk++ SDK Programmer's guide, although it is for Cilk++ and 
not Cilk Plus. I haven't been able to find an equivalent version of a programmer's guide for 
Cilk Plus, but I'll keep looking: Intel Cilk++ Programmers Guide>

Finally, I think this e-book is very helpful as an introduction to Cilk for anyone that wants 
to read it (off of prof. John Gilbert's web page):  CilkBook 

Running

When you have a job running, you are allocated the nodes
requested. At that time, a PBS prologue script runs that
allows you direct ssh access to your nodes.
At the conclusion of your job, that privilege is removed.

Interactive
You can use "qsub -I" to get exclusive access to a set of nodes, 
where you can perform interactive analyses. 
If you need one processor:  
  qsub -I -l walltime=00:10:00

Examples:

To run an interactive job with a wall clock limit of 30 minutes, 
using two nodes and two processors per node:

$ qsub -I -l walltime=00:30:00 -l nodes=2:ppn=2
qsub: waiting for job 75.tscc-mgr.local to start
qsub: job 75.tscc-mgr.local ready

$ echo $PBS_NODEFILE
/var/spool/torque/aux//1083840.tscc-mgr.local

Then you can use "more" or other editors, such as "vi"
to see the information contained in the file. For example,
in this particular case, four processors were allocated as 
requested. Two of those are located on node 39, and the other
two on node 36.

$ more /var/spool/torque/aux//1083840.tscc-mgr.local
tscc-0-39
tscc-0-39
tscc-0-36
tscc-0-36

To run a job:
$ mpirun -machinefile $PBS_NODEFILE -np 4 execfile

Batch

See: Running Batch Jobs
http://idi.ucsd.edu/computing/jobs/index.html 

Example: Script file for the HOTEL queue:

#!/bin/csh
#PBS -q hotel 
#PBS -N hello
#PBS -l nodes=1:ppn=4
#PBS -l walltime=0:05:00
#PBS -o hello-out
#PBS -e hello-err
#PBS -V
cd /home/u4078/cs140/compile-run 
mpirun -v -machinefile $PBS_NODEFILE -np 4 mpi_hello > h-out

Numerical Libraries & Peformance Tools


NUMERICAL LIBRARIES


The Portland Group compilers come with the Optimized ACML library (LAPACK/BLAS/FFT).

ACML user guide is in the following location:
/opt/pgi/linux86-64/8.0-6/doc/acml.pdf

Example BLAS, LAPACK, FFT codes in:
/home/diag/examples/ACML

Compile and link as follows:
pgf90 dzfft_example.f -L/opt/pgi/linux86-64/8.0-6/lib -lacml
pgcc -L/opt/pgi/linux86-64/8.0-6/lib lapack_dgesdd.c -lacml -lm -lpgftnrtl -lrt
pgcc -L/opt/pgi/linux86-64/8.0-6/lib blas_cdotu.c -lacml -lm -lpgftnrtl -lrt


Intel Intel has developed Math Kernel Library (MKL) which contains many linear algebra, FFT and other useful numerical routines. * Basic linear algebra subprograms (BLAS) with additional sparse routines * Fast Fourier Transforms (FFT) in 1 and 2 dimensions, complex and real * The linear algebra package, LAPACK * A C interface to BLAS * Vector Math Library (VML) * Vector Statistical Library (VSL) * Multi-dimensional Discrete Fourier Transforms (DFTs) To link the MKL libraries To link the MKL libraries, please refer to the Intel MKL Link Line Advisor Web page. This tool accepts inputs for several variables based on your environment and automatically generates a link line for you. When using the output generated by this site, substitute the Triton path of the Intel MKL for the value $MKLPATH in the generated script. That value is${MKL_ROOT}/lib/em64t. Examples in the following directory: /home/diag/examples/MKL LAPACK example using MKL Compile as follows: export MKLPATH=/opt/intel/Compiler/11.1/072/mkl ifort dgebrdx.f -I$MKLPATH/include $MKLPATH/lib/em64t/libmkl_solver_lp64_sequential.a -Wl,--start-group $MKLPATH/lib/em64t/libmkl_intel_lp64.a $MKLPATH/lib/em64t/libmkl_sequential.a $MKLPATH/lib/em64t/libmkl_core.a -Wl,--end-group libaux_em64t_intel.a -lpthread Output: ./a.out < dgebrdx.d ScaLAPACK example using MKL Sample test case (from MKL examples) is in: /home/diag/examples/scalapack The make file is set up to compile all the tests. Procedure: module purge module load intel module load openmpi_mx make libem64t compiler=intel mpi=openmpi LIBdir=/opt/intel/Compiler/11.1/072/mkl/lib/em64t Sample link line (to illustrate how to link for scalapack): /opt/openmpi/bin/mpicc -o mm_pblas mm_pblas.c -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.a /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_intel_lp64.a -Wl,--start-group /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_sequential.a /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread mm_pblas.c

GPROF


GPROF is the GNU Project PROFiler.   

Requires recompilation of the code.

Compiler options and libraries provide wrappers for each routine call and periodic sampling of the program. 

A default gmon.out file is produced with the function call information.

GPROF links the symbol list in the executable with the data in gmon.out. 

Types of Profiles
Flat Profile
CPU time spend in each function (self and cumulative) 
Number of times a function is called
Useful to identify most expensive routines

Call Graph
Number of times a function was called by other functions
Number of times a function called other functions
Useful to identify function relations
Suggests places where function calls could be eliminated

Use the -pg flag during compilation:
% gcc  -g -pg ./srcFile.c
% icc  -g -p  ./srcFile.c
% pgcc -g -pg ./srcFile.c

Run the executable. An output file gmon.out will be generated with the profiling information.

Execute gprof and redirect the output to a file:
% gprof    ./exeFile gmon.out > profile.txt
% gprof -l ./exeFile gmon.out > profile_line.txt


IPM


IPM is a portable profiling infrastructure 
for parallel codes. It provides a low-overhead performance profile of the performance 
aspects and resource utilization in a parallel program.  

On TRITON the library is located in:
   /opt/ipm 

To run: 

module unload openmpi_ib
module load mvapich2_ib
module load ipm
module load papi

qsub -I -l walltime=00:30:00 -l nodes=1:ppn=2
mpicc mpi_hello.c -L$IPMHOME/lib -L$PAPIHOME/lib -lipm -lpapi
mpirun_rsh -np 4 -hostfile $PBS_NODEFFILE ./a.out

FPMPI


FPMPI Is a simple MPI profiling library. It is intended as a first step towards
understanding the nature of the communication patterns and potential bottlenecks
in existing applications.

Applications run which are linked to FPMPI will generate an output file, fpmpi_profile.txt.
This file contains:

    * description: A brief description of fpmpi_profile.txt format.
    * synchronization data: A listing of the synchronizing routines used and some related 
      profile data.
    * asynchronous communication data: A listing of the asynchronous communication routines 
      used and some related profile data.
    * topology data: A brief output of the communication topology.

On TRITON the library is located in:
   /opt/openmpi/intel/ib/lib   

To run, just relink with the library. For example:
   /opt/openmpi/intel/ib/bin/mpicc -o trap-fpmpi trap.c -L/opt/openmpi/intel/ib/lib -lfpmpi

   qsub -I -l walltime=00:20:00 -l nodes=1:ppn=4

   mpirun -machinefile $PBS_NODEFILE -np 4 trap-fpmpi

fpmpi_profile.txt


TAU/PDT


TAU Performance System is a portable profiling and tracing toolkit for performance analysis 
of parallel programs written in Fortran, C, C++, Java, Python. 
TAU's profile visualization tool, paraprof, provides graphical displays of all the performance 
analysis results, in aggregate and single node/context/thread forms. New users
may find this TAU-workshop tutorial helpful, which includes also the following  lab exercises.
 
TAU location: /opt/tau/
PAPI location: /opt/papi/

PAPI - Performance Application Programming Interface provides 
the tool designer and application engineer with a consistent interface and methodology for use of the 
performance counter hardware found in most major microprocessors. PAPI enables software engineers to 
see, in near real time, the relation between software performance and processor events.

Load the TAU environment:
 module load tau
 module load papi

 export PATH=/opt/tau/intel/openmpi/x86_64/bin:$PATH
 export LD_LIBRARY_PATH=/opt/tau/intel/openmpi/x86_64/lib:$LD_LIBRARY_PATH

Select the appropiate TAU MAKEFILE based on your choices. For example:
  /opt/tau/intel/openmpi_ib/x86_64/lib/Makefile.tau-icpc-mpi-pdt

So, we set it up:
% export TAU_MAKEFILE=/opt/tau/intel/openmpi_ib/x86_64/lib/Makefile.tau-icpc-mpi-pdt

And we compile using the wrapper provided by tau:
% tau_cc.sh trap.c
or, for Makefiles, edit Makefile and change mpif90/mpicc = tau_f90.sh/tau_cc.sh.

Run the job through the queue normally. We obtain the following profile files [on 4 processors]:
     profile.0.0.0, profile.1.0.0, profile.2.0.0 & profile.3.0.0  

Analyze performance data:

 pprof - for text based display - output of PPROF

 paraprof - for GUI

 jumpshot - for GUI - Using Jumpshot-4

GUI environment:
a. On PC systems [PUTYY]: select X11 forwarding.
   On Linux & MAC OS: ssh -X ...

b. On TRITON, 
  -  Connect to the compute nodes, with X forwarding:
       qsub -I -X -l walltime=00:20:00 -l nodes=1:ppn=4

  -  Go to the directory where the "profile.0.0.0, etc." are stored.

  - Set the TAU path:
    module load tau
    module load papi

    export PATH=/opt/tau/intel/openmpi_ib/x86_64/bin:$PATH
    export LD_LIBRARY_PATH=/opt/tau/intel/openmpi_ib/x86_64/lib:$LD_LIBRARY_PATH

Use 'paraprof', to analyze performance data:
        paraprof

To use the trace option:
   - after compiling, set the environmental variable:
       export TAU_TRACE=1

   - run the code:
        mpirun -machinefile $PBS_NODEFILE -np 4 ./execfile

   - execute the following commands:
        tau_treemerge.pl
        tau2slog2 tau.trc tau.edf -o app.slog2
   
   - run jumpshot:
        jumpshot app.slog2

HADOOP on Triton


Here is some more info on what the scripts do and the setup involved before the run:

#1 Setup

In the persistent mode myHadoop's configure script needs the location to use and subdirectories 
named by numbers. For example in my test above I chose the following location:

/oasis/triton/scratch/diag/hadoop/data

and made the following 4 directories in this location:

[diag@tcc-3-43 data]$ mkdir 1 2 3 4
[diag@tcc-3-43 data]$ ls
1  2  3  4

#2 First run of the persistent set up (myhadoop_persistent_setup.cmd). This looks just like the 
normal example, except the configure.sh script is given the persistent option:

$MY_HADOOP_HOME/bin/configure.sh -n 4 -c $HADOOP_CONF_DIR -p -d /oasis/triton/scratch/diag/hadoop/data

(its point to the base location we created above)

In the example we copy in the .bashrc file which we will look for in the second run to make sure 
we still have the data from the first hadoop run. In my example the job ran on the following nodes:

  tcc-3-45 tcc-3-51 tcc-3-52 tcc-3-53

#3 The above job completed and now we will check if we can spin up the same hadoop cluster using a 
second job and potentially different set of compute nodes (myhadoop_persistent_restart.cmd). The 
changes we make in the script:

(a) *Do not* format the HDFS, we have this line commented out. This will enable us to keep the 
data from the previous run.
(b) After cluster start up, move dfs out of safemode (to allow for writes on the second run);

$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave

Note that we are still running the configure script because the new compute nodes need to be in 
the configuration files for hadoop. In this example I list the contents of the test HDFS directory 
to verify that the old data is still there and then copy in another file. Sample output of dfs ls:

Found 1 items
-rw-r--r--   3 diag supergroup        878 2013-02-27 03:16 /user/diag/Test/.bashrc
Found 2 items
-rw-r--r--   3 diag supergroup      16450 2013-02-27 03:26 /user/diag/Test/.bash_history
-rw-r--r--   3 diag supergroup        878 2013-02-27 03:16 /user/diag/Test/.bashrc

(First ls shows we still have the .bashrc from the first run and then the second shows the newly 
copied in .bash_history file)

#4 If you check the directories in /oasis/triton/scratch/diag/hadoop/data you can see the data from 
HDFS in the numbered directories (1,2,3,4). For example:

$ ls /oasis/triton/scratch/diag/hadoop/data/2
dfs  mapred

Only note of caution is to restrict this to small job sizes (4 nodes is fine) as lustre has meta 
data limitations that might show up if you try a large hadoop job.



Notes & Examples - Stampede
Stampede User's Guide



Sample Programs

Serial MPI OPENMP
OPENMP-MPI Hybrid
CILK
CILKPLUS
CUDA
PAPI and TAU