PARADOX-IV User Guide

Introduction

PARADOX Cluster at the Scientific Computing Laboratory of Institute of Physics Belgrade consists of 106 compute nodes (2 x 8 core Sandy Bridge Xeon 2.6GHz processors with 32GB of RAM + NVIDIA® Tesla™ M2090) interconnected by the QDR InfiniBand network.

PARADOX

System configuration

Hardware description

PARADOX is an HP Proliant SL250s based cluster with the following components:

Compute nodes: HP Proliant SL250s
Processors Type: Intel® Xeon® Processor E5-2670 (Sandy Bridge, 8 Core, 20M Cache, 2.60 GHz)
Number of nodes: 106
Number of CPU cores: 1696
Number of GPUs: 106 NVIDIA® Tesla™ M2090 (5375MB of RAM, 512 CUDA cores at 1.3GHz, Compute capability 2.0)
RAM: 32 GB/node (4x8GB) DRR3 1600MHz
Network infrastructure: InfiniBand QDR

Operating system: The operating system on PARADOX cluster is Scientific Linux 6.4.

Filesystems

There is one Lustre file system on PARADOX which is mounted on /home. It is shared between worker nodes and used both for long term storage and cluster job submission. This file system has directories /home/<USERNAME> created for each user.

Additionally, there is another local file system available on each worker node: /scratch. This file system should be used only for temporary storage of running jobs (each compute node has 500GB local hard disk).

Logging in

The primary way of access to PARADOX-IV cluster is via secure shell (ssh). Users on Linux and MacOS already have a client accessible via system terminal. On Windows there are several options available, the most popular is Putty, and we also recommend SmarTTY which has integrated X11 forwarding and file upload/download interface.

Depending on network you are accessing from, there are two ways to connect to the cluster. If you are connecting from within the Institute of Physics Belgrade local network, the cluster head node is directly accessible on paradox.ipb.ac.rs.

Example: connecting to PARADOX-IV from the Institute local network:

$ ssh username@paradox.ipb.ac.rs

However, if you are connecting from outside network (internet access), you must connect to the Gateway machine first on gw.ipb.ac.rs, and selecting option #1 on the menu to connect to PARADOX-IV cluster.

Example: connecting to PARADOX-IV over the Internet:

$ ssh username@gw.ipb.ac.rs

The gateway will offer a menu with following choice:

Welcome to PARADOX Cluster. This is the gateway machine for remote 
PARADOX access.


Would you like to:

        1. Connect to PARADOX-IV cluster
        2. Connect to PARADOX-III cluster
        3. Connect to PARADOX Hadoop
        4. Continue working in gateway (this machine)

Please select one of the above (1-4):

After entering 1 and pressing enter you will be logged into paradox.ipb.ac.rs which is the PARADOX-IV head node.

Also, if you would like to use GUI applications, you can forward their interface to your machine by adding -X flag to your ssh command, i.e. ssh username@gw.ipb.ac.rs -X if you are connecting from outside IPB, or ssh username@paradox.ipb.ac.rs -X from local IPB network.

Copying files to and from the cluster

The easiest way to copy a file to or from PARADOX-IV cluster is to use the scp command on Unix systems, and on Windows we’ve already mentioned that SmartTTY has interface for file transfer, but you can also try WinSCP.

The scp command functions like copy (cp) command as it takes two arguments – source and destination. Either source or destination or both can refer to a local file or a file on a remote machine. Local file paths are specified in the standard manner, but the syntax for remote paths is: username@paradox:path/on/the/remote/host. The path on the remote host can be absolute (i.e. if it starts with /) or relative in which case it is relative to user’s home directory.

Example: Transfering a file from your pc to PARADOX-IV user’s home folder:

$ scp  my/path/to/file  username@paradox.ipb.ac.rs:

Note: If you want to transfer files from a machine outside the IPB’s local network, you should execute the scp command from the paradox head node like the following:

user@paradox:~$ scp user@my.outside.machine:/file/path path/on/paradox

If the remote machine is not running ssh server, then you will need to transfer the files first to gw.ipb.ac.rs and then from there to paradox.

Change password

After connecting to the gateway(gw.ipb.ac.rs) and choosing option #4 to stay on gateway, you can change your password by issuing command passwd.

Logging out

To logout from paradox.ipb.ac.rs or gw.ipb.ac.rs, use the Ctrl-d command, or type exit.

User environment and programming

Environment modules

To manage the use of multiple versions of libraries and tools, Paradox cluster uses Enviroment Modules^🔗.

Paradox cluster uses Environment Modules to set up user environment variables for various development, debugging and profiling scenarios. The modules are divided into applications, environment, compilers, libraries and tools categories. Available modules can be listed with following command:

$ module avail

The list of currently loaded modules can be brought up by tying:

$ module list

Each module can be loaded by executing:

$ module load module_name

Specific modules can be unloaded by calling:

$ module unload module_name

All modules can be unloaded by executing:

$ module purge

The following modules are available:

Applications: cp2k, Gromacs, Grace, NAMD, Quantum-Espresso, TRIQS
Environment: PCPE, Python 2.7, tcl
Compilers: Intel, GNU, PGI
Libraries: Boost, OpenBLAS, MKL, FFTW 2 & 3, NFFT, GSL, hdf5, netcdf, SPRNG
Parallel: OpenMPI, IntelMPI, CUDA (versions 5.5, 6.0, 6.5, 7.5)
Tools: cmake, Scalasca, GDB, cube, git, gnuplot, Intel Advisor, Intel clck, Intel Inspector, Intel Prallel Studio XE, pdtoolkit, scorep, TAU, TotalView, VMD, xcrysden.

Batch system

Batch system pools the various distributed resources of a cluster computer and presents them as a single coherent system which simplifies execution of user jobs.

Job submissions, resources allocations and the jobs launching over the cluster are managed by the batch system (torque+maui scheduler). From paradox.ipb.ac.rs jobs can be submitted using the command qsub.

To submit a batch job, you first have to write a shell script which contains:

A set of directives. These directives are lines beginning with #PBS which describe needed resources for your job,
Lines necessary to execute your code.

Then your job can be launched by submitting this script to batch system. The job will enter into a batch queue and, when resources are available, job will be launched over allocated nodes. Batch system provides monitoring of all submitted jobs.

Queue standard is available for user’s job submission.

Frequently used PBS commands for getting the status of the system, queues, or jobs are:

Command	Description
`qstat`	list information about queues and jobs
`qstat –q`	list all queues on system
`qstat –Q`	list queue limits for all queues
`qstat –a`	list all jobs on system
`qstat –au userID`	list all jobs owned by user userID
`qstat –s`	list all jobs with status comments
`qstat –r`	list all running jobs
`qstat –f jobID`	list all information known about specified job
`qstat –n`	in addition to the basic information, nodes allocated to a job are listed
`qstat –Qf <queue>`	list all information about specified queue
`qstat –B`	list summary information about the PBS server
`qdel jobID`	delete the batch job with jobID
`qalter`	alter a batch job
`qsub`	submit a job

Some common PBS directives are the following:

Directive	Description
`-l`	Specifies (and limits) the resources required to execute the job. If omitted, the PBS scheduler will use default values.
`-N`	Defines name for the job.
`-o`	Redirects the standard output of the job to the given path.
`-e`	Redirects the error output of the job to the given path.
`-p`	Defines the priority of the job, the value should be between -1024 and 1023.

The -l directive for specifying resource allocation can be given the following parameters:

Parameter	Description
`nodes`	Specifies the number of nodes required and their parameters. Nodes are specified in the format nodes=:ppn=.
`walltime`	Specifies the maximum wallclock time for the job in HH:MM:SS format.
`cput`	Specifies the maximum CPU time for the job in HH:MM:SS format.
`mem`	Specifies the maximum amount of RAM to be used by the job. Positive integer followed by b, kb, mb, gb for bytes, kilobytes megabytes and gigabytes respectively. If the suffix is omitted, bytes are assumed.

Some common environment variables:

Variable	Description
`PBS_JOBID`	A unique job identifier assigned by the system.
`PBS_JOBNAME`	The job name specified by the user.
`PBS_O_HOST`	Hostname upon which the `qsub` command was ran.
`PBS_O_WORKDIR`	Absolute path of the current working directory for the `qsub` command.

Sequential job submission

Here is a sample sequential job PBS script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:10:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x job.sh
./job.sh

#!/bin/bash - Specifies the shell to be used when executing the command portion of the script.
#PBS -q <queue> - Directs the job to the specified queue. Queue standard should be used.
#PBS -o <name> - Writes standard output to (in this case it is ${PBS_JOBID}.out) instead of .o$PBS_JOBID. $PBS_JOBID is an environment variable created by PBS that contains the PBS job identifier.
#PBS -e <name> - Writes standard error to (in this case it is ${PBS_JOBID}.err) instead of .e$PBS_JOBID.
#PBS -l walltime=<time> - Maximum wall-clock time. is in the format HH:MM:SS.
cd \$PBS_O_WORKDIR - Change to the initial working directory.
#PBS -l nodes=1:ppn=1 - Number of nodes (nodes) to be reserved for exclusive use by the job and number of virtual processors per node (ppn) requested for this job. For sequential job one CPU on a node will be sufficient. We would have the same effect if this line was left out from the PBS script.
job.sh contains a simple script to get date, host name and current directory on the node that executes it.

Example job.sh script:

#!/bin/bash
date
hostname
pwd
sleep 10

This job can be submitted by issuing following command:

$ qsub job.pbs

The qsub command will return а result of the type:

<JOB_ID>.paradox.ipb.ac.rs

Where <ЈОВ_ID> is а unique integer used to identify the given job. To check the status of your job use the following command:

$ qstat <ЈОВ_ID>

This will return an output similar to:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
<JOB_ID>.paradox             job.pbs         <username>    16:30:58 R standard

Alternatively you can check the status of all your jobs using the following syntax of the qstat command:

$ qstat -u <user_name>

To get detailed information about your job use the following command:

$ qstat -f <JOB_ID>

When your job is finished, files to which standard output and standard error of a job was redirected will appear in your work directory. If, for some reason, you want to cancel a job following command should be executed:

$ qdel <JOB_ID>

If qstat <ЈОВ_ID> returns the following line:

qstat: Unknown Job Id <JOB_ID>.paradox

This most likely means that your job has finished.

MPI Job submission

Here is an example of an MPI job submission script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=2:ppn=16
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x prog
module load openmpi/1.8.2
mpirun ./prog

MPI launcher, together with the batch system will take care of proper launching of a parallel job, i.e. no need to specify number of MPI instances to be launched or machine file in the command line. All these information launcher will obtain from the batch system. All stated PBS directives are same as for the sequential job except the resource allocation line which is, in this case:

#PBS -l nodes=2:ppn=16

In this statement we are demanding 2 nodes with 16 cores each (2 full nodes, as PARADOX worker nodes are 16 cores machines), all together 32 MPI instances.

Job can be submitted by issuing following command:

$ qsub job.pbs

By using the qstat command we can view the resources allocated for our parallel job.

$ qstat -n <JOB_ID>

paradox.ipb.ac.rs:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
JOB_ID.paradox.ipb.ac.   username   standard job.pbs            4397     1     16    --  256000:00 R  16:43:29
Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007+Gn007

Job monitoring and canceling is no different than for sequential job and it is already described in sequential job submission section.

OpenMP job submission

Here is an example of an openMP job submission script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=16
#PBS -l walltime=00:10:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x prog
export OMP_NUM_THREADS=16
./prog

Executable job is compiled with OpenMP (see 4.3.2. Compiling OpenMP programs section). For the execution of the OpenMP jobs you shouldn’t use more than one node, as it is specified in PBS script:

#PBS -l nodes=1:ppn=16

OpenMP is a shared memory parallel computational library and as such the processes cannot be forked among various machines. Thus, an OpenMP job on the PARADOX Cluster can at most consume 16 CPUs in parallel since the largest SMP on the cluster has 16 cores. OMP_NUM_THREADS environment variable should be specified, especially in the case when user is not allocating whole node for its job (not using ppn=16). If this is the case and the number of threads is not specified in program, OpenMP executable will use 16 threads (worker nodes at PARADOX have 16 CPU cores) and potentially compete for CPU time with the other jobs running at the same node. Job submitting, monitoring and canceling is the same as for other previously described types of jobs.

If the job binary was compiled with intel compiler, than the appropriate intel modules should be loaded, e.g:

module load intel/14.0.1

Hybrid job submission

The following example shows a typical hybrid job submission script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=4:ppn=16
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
module load openmpi/1.8.2
export OMP_NUM_THREADS=16

cd $PBS_O_WORKDIR
chmod +x prog

mpirun -np 4 -npernode 1 --bind-to none ./prog

It is important to take note of the third line in this script which specifies how the processes should be laid out across compute nodes. It is important to adjust the ppn parameter in accordance with OMP_NUM_THREADS environment setting to avoid oversubscribing a node. The mpirun parameter npernode also controls the actual number of processes that will be assigned to a node, because the PBS line only serves for the resource allocation.

Another detail is the –bind-to none parameter which is needed for OpenMPI versions from 1.8 on, which allows threads to spread across the cores. For more information about process and thread mapping, please see mpirun documentation and the –map-by parameter.

CUDA single node job submission

CUDA jobs are very similar to previous types, except that cuda module should be loaded and since login node has no GPU, these programs can only run on compute nodes. Here is an example of an CUDA job submission script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=1
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x prog
module load cuda/7.5
./prog

CUDA MPI job submission

#!/bin/bash
#PBS -q standard
#PBS -l nodes=4:ppn=1
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x prog
module load openmpi/1.8.2
module load cuda/7.5
mpirun ./prog

Compiling

Compiler flags

C/C++

Intel compilers: icc and icpc. Compilation options are the same, except for the the C language behavior. icpc manages all the source files as C++ files whereas icc makes a difference between both of them. Basic flags:

-o exe_file : names the executable exe_file
-c : generates the correspondent object file. Does not create an executable.
-g : compiles in a debugging mode
-I dir_name : specifies the path where include files are located.
-L dir_name : specifies the path where libraries are located.
-l<lib_name> : asks to link against the lib library

Optimizations:

-O0, -O1, -O2, -O3 : optimization levels - default is -O2
-opt_report: generates a report which describes the optimization in stderr (-O3 required)
-ip, -ipo: inter-procedural optimizations (mono and multi files)
-fast: default high optimization level (-O3 -ipo -static).
-ftz: considers all the denormalized numbers (like INF or NAN) as zeros at runtime.
-fp-relaxed: mathematical optimization functions. Leads to a small loss of accuracy.

Preprocessor:

-E: preprocess the files and sends the result to the standard output
-P: preprocess the files and sends the result in file.i
-Dname=<value>: defines the “name” variable
-M: creates a list of dependence

Practical:

-p: profiling with gprof (needed at the compilation)
-mp, -mp1: IEEE arithmetic, mp1 is a compromise between time and accuracy

Fortran

Intel compiler: ifort (Fortran compiler).

Basic flags :

-o exe_file : names the executable exe_file
-c : generates the correspondent object file does not create an executable.
-g : compiles in debugging mode - R.E. ‘Debugging’
-I dir_name : specifies the path where include files are located
-L dir_name : specifies the path where libraries are located
-l<libname> : asks to link against the lib library

Optimizations:

-O0, -O1, -O2, -O3 : optimization levels - default : -O2
-opt_report : generates a report which describes the optimization in stderr (-O3 required)
-ip, -ipo : inter-procedural optimizations (mono and multi files)
-fast : default high optimization level (-O3 -ipo -static).
-ftz : considers all the INF and NAN numbers as zeros
-fp-relaxed : mathematical optimization functions. Leads to a small loss of accuracy
-align all: fills the memory up to get a natural alignment of the data
-pad: makes the modification of the memory positions operational

Run-time check:

-C or -check : generates a code which ends up in ‘run time error’ (ex : segmentation fault)

Preprocessor:

-E: preprocess the files and sends the result to the standard output
-P: preprocess the files and sends the result in file.i
-Dname=<value>: defines the “name” variable
-M: creates a list of dependences
-fpp: preprocess the files and compiles

Practical:

-p : profiling with gprof (needed at the compilation)
-mp, -mp1 : IEEE arithmetic, mpl is a compromise between time and accuracy
-i8 : promotes integers on 64 bytes by default
-r8 : promotes reals on 64 bytes by default
-module <dir> : send/read the files *.mod in the dir directory
-fp-model strict : Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It might slow down your program.

Please refer to the ‘man pages’ of the compilers for more information.

Compiling MPI programs

Here is an example of a MPI program:

#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
    int num_procs, my_id;
    int len;
    char name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);

    // find out process ID, and how many rocesses were started.
    MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
    MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
    MPI_Get_processor_name(name, &len);

    printf("Hello, world.I'm process %d of %d on s\n", my_id,
           num_procs, name);

    MPI_Finalize();
}

MPI implementations are using mpicc, mpic++, mpif77 and mpif90 wrappers for compiling and linking MPI programs:

$ mpicc -o test test.c

Compiling OpenMP programs

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
    int nthreads, tid;
    // Fork a team of threads giving them their own copies of
    // variables
    #pragma omp parallel private(nthreads, tid)
    {
        // Obtain thread number
        tid = omp_get_thread_num();
        printf("Hello World from thread = %d\n", tid);
        // Only master thread does this
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
        }
    }   // All threads join master thread and disband
}

Intel compilers flag: -openmp

$ icc -openmp -o prog prog.c

It is recommended to compile OpenMP programs with static linking of intel libraries at paradox.ipb.ac.rs machine before submission:

$ icc -openmp -static-intel -o prog prog.c

GNU compilers flag: -fopenmp

$ gcc -fopenmp -o prog prog.c

Compiling hybrid programs

Hybrid programs are supported by combination of any of the installed mpi libraries and compiler suites. The appropriate openmp flag should be passed to the compiler of choice, as described in the previous section. The following is the source code for a simple mpi-openmp hybrid application which just prints out process and thread ids:

#include <omp.h>            // OpenMP Library
#include <stdio.h>          // printf()
#include <stdlib.h>         // EXIT_SUCCESS
int main (int argc, char *argv[]) {
    system("hostname");
    // Parameters of OpenMP.
    int O_P;                                 // number of OpenMP processors
    int O_T;                                 // number of OpenMP threads
    int O_ID;                                // OpenMP thread ID
    // Get a few OpenMP parameters.
    O_P  = omp_get_num_procs();              // get number of OpenMP processors
    O_T  = omp_get_num_threads();            // get number of OpenMP threads
    O_ID = omp_get_thread_num();             // get OpenMP thread ID
    printf("O_ID:%d  O_P:%d  O_T:%d\n", O_ID,O_P,O_T);
    // PARALLEL REGION
    // Thread IDs range from 0 through omp_get_num_threads()-1.
    // We execute identical code in all threads (data parallelization).
    #pragma omp parallel private(O_T,O_ID)
    {
        O_T  = omp_get_num_threads();            // get number of OpenMP threads
        O_ID = omp_get_thread_num();             // get OpenMP thread ID
        printf("parallel region:           O_ID=%d  O_T=%d\n", O_ID,O_T);
    }
    // Exit master thread.
    printf("O_ID:%d   Exits\n", O_ID);
    return EXIT_SUCCESS;
}

To compile the hybrid code, the following lines should be executed:

$ module load gnu                                # or intel
$ module load openmpi
$ mpicc -fopenmp hybrid_example.c -o prog        #or -openmp (for intel)

Compiling CUDA programs

CUDA 6.5 and 7.5 toolkits are available, however, the login node does not have a GPU so the compiled binary is runnable only on compute nodes.

The following is a hello world cuda program:


// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it
// to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce
// the string "World!"
// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;
const int blocksize = 16;

__global__
void hello(char *a, int *b)
{
    a[threadIdx.x] += b[threadIdx.x];
}

int main()
{
    char a[N] = "Hello \0\0\0\0\0\0";
    int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    char *ad;
    int *bd;
    const int csize = N*sizeof(char);
    const int isize = N*sizeof(int);

    printf("%s", a);

    cudaMalloc( (void**)&ad, csize );
    cudaMalloc( (void**)&bd, isize );
    cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
    cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

    dim3 dimBlock( blocksize, 1 );
    dim3 dimGrid( 1, 1 );
    hello<<<dimGrid, dimBlock>>>(ad, bd);
    cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
    cudaFree( ad );
    cudaFree( bd );

    printf("%s\n", a);
    return EXIT_SUCCESS;
}

The code can be compiled with following commands:

$ module load cuda/7.5
$ nvcc hello.cu -o hello

Compiling CUDA-MPI programs

As a tutorial we will go through building and launching of a simple application that multiplies to randomly generated vectors of numbers in parallel, using mpi and cuda. This is the source for the cuda kernel, which has been saved into multiply.cu file:

#include <cuda.h>
#include <cuda_runtime.h>
#include <math.h>

__global__
void kmultiply(const float* a, float* b, int n) {
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i < n)
        b[i] *= a[i];
}

extern "C"
void launch_multiply(const float* a, float* b, int n) {
    float* dA;
    float* dB;
    int cerr;

    cerr = cudaMalloc((void**)&dA, n*sizeof(float));
    cerr = cudaMalloc((void**)&dB, n*sizeof(float));
    cerr = cudaMemcpy(dA, a, n*sizeof(float), cudaMemcpyHostToDevice);
    cerr = cudaMemcpy(dB, b, n*sizeof(float), cudaMemcpyHostToDevice);

    kmultiply<<<ceil((float)n/256), 256>>>(dA, dB, n);

    cerr = cudaThreadSynchronize();

    cerr = cudaMemcpy(b, dB, n*sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(dA);
    cudaFree(dB);
}

The main program, saved as main.c contains the following source code:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <math.h>

void launch_multiply(const float* a, float* b, int n);

int main(int argc, char** argv) {
    int rank, nprocs;
    int n = 1000000;
    int chunk;
    float *A, *B;
    float *pA, *pB;

    int num_procs, my_id;
    int len;

    char name[MPI_MAX_PROCESSOR_NAME];

    if (argc > 1) {
        n = atoi(argv[1]);
    }

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    chunk = ceil((float)n/nprocs);

    A = (float*) malloc(n*sizeof(float));
    B = (float*) malloc(n*sizeof(float));
    pA = (float*) malloc(chunk*sizeof(float));
    pB = (float*) malloc(chunk*sizeof(float));

    MPI_Get_processor_name(name, &len);

    printf("process %d of %d on %s\n", rank, nprocs, name);

    if (rank == 0) {
        //prepare arrays...
        for (int i = 0; i < n; i++) {
            A[i] = ((float)rand()/RAND_MAX);
            B[i] = ((float)rand()/RAND_MAX);
        }
    }

    MPI_Scatter(A, chunk, MPI_FLOAT, pA, chunk, MPI_FLOAT, 0, MPI_COMM_WORLD);
    MPI_Scatter(B, chunk, MPI_FLOAT, pB, chunk, MPI_FLOAT, 0, MPI_COMM_WORLD);

    launch_multiply(pA, pB, chunk);

    MPI_Gather(pB, chunk, MPI_FLOAT, B, chunk, MPI_FLOAT, 0, MPI_COMM_WORLD);

    free(A);
    free(B);
    free(pA);
    free(pB);

    MPI_Finalize();

    return 0;
}

For the compilation of the tutorial code, the following should be executed:

$ module load cuda/7.5
$ module load openmpi/1.8.2
$ nvcc -c multiply.cu -o multiply.o
$ mpicc -std=c99 -o prog main.c multiply.o -L/usr/local/cuda-7.5/lib64 -lcudart -lstdc++

The job submission script has the same layout as the script given in section “CUDA MPI job submission”, with one difference in the last line that launches the program:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=4:ppn=1
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
cd $PBS_O_WORKDIR
chmod +x prog
module load openmpi/1.8.2
module load cuda/7.5
mpirun ./prog 1000000

The number is passed as argument to the program which sets the required level of precision which can be used to make the execution time longer or shorter.

PGI Compilation example

PGI compilers and libraries are available on Paradox in modules named pgi and pgi64, the later being preferred one. Along with standard C/C++ and Fortran compilation (pgcc, pgcpp, pgfortran), PGI compilers support accelerator card programming in CUDA for C/C++ and Fortran, and also support OpenACC directives. The following example was taken from Nvidia’s Parallel Forall blog and it demonstrates use of OpenACC.

/*
*  Copyright 2012 NVIDIA Corporation
*
*  Licensed under the Apache License, Version 2.0 (the "License");
*  you may not use this file except in compliance with the License.
*  You may obtain a copy of the License at
*
*      http://www.apache.org/licenses/LICENSE-2.0
*
*  Unless required by applicable law or agreed to in writing, software
*  distributed under the License is distributed on an "AS IS" BASIS,
*  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  implied.
*  See the License for the specific language governing permissions and
*  limitations under the License.
*/

#include <math.h>
#include <string.h>
#include "timer.h"

int main(int argc, char** argv) {
    int n = 4096;
    int m = 4096;
    int iter_max = 1000;

    const float pi  = 2.0f * asinf(1.0f);
    const float tol = 1.0e-5f;
    float error     = 1.0f;

    float A[n][m];
    float Anew[n][m];
    float y0[n];

    memset(A, 0, n * m * sizeof(float));

    // set boundary conditions
    for (int i = 0; i < m; i++) {
        A[0][i]   = 0.f;
        A[n-1][i] = 0.f;
    }

    for (int j = 0; j < n; j++) {
        y0[j] = sinf(pi * j / (n-1));
        A[j][0] = y0[j];
        A[j][m-1] = y0[j]*expf(-pi);
    }

    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

    StartTimer();
    int iter = 0;

 #pragma omp parallel for shared(Anew)
    for (int i = 1; i < m; i++) {
        Anew[0][i]   = 0.f;
        Anew[n-1][i] = 0.f;
    }
 #pragma omp parallel for shared(Anew)
    for (int j = 1; j < n; j++) {
        Anew[j][0]   = y0[j];
        Anew[j][m-1] = y0[j]*expf(-pi);
    }
     while ( error > tol && iter < iter_max ) {
        error = 0.f;

 #pragma omp parallel for shared(m, n, Anew, A)
 #pragma acc kernels
        for( int j = 1; j < n-1; j++) {
            for( int i = 1; i < m-1; i++ ) {
                Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                        + A[j-1][i] + A[j+1][i]);
                error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
            }
        }

 #pragma omp parallel for shared(m, n, Anew, A)
 #pragma acc kernels
        for( int j = 1; j < n-1; j++) {
            for( int i = 1; i < m-1; i++ ) {
                A[j][i] = Anew[j][i];
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        iter++;
    }

    double runtime = GetTimer();
    printf(" total: %f s\n", runtime / 1000.f);
}

(timer.h can be found in the github repository at the following address: https://github.com/parallel-forall/code-samples.git) The most straightforward way to compile this example is to use the following commands:

$ module load pgi64
$ pgcc -I../common -acc -ta=nvidia,time -Minfo=accel laplace2d.c -o laplace2d_acc

This creates openACC version. Since the paradox login node does not have an accelerator card, this example must be submitted to PBS for execution on compute nodes. The following script could be used for the example above:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=16
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
#PBS -A example

cd $PBS_O_WORKDIR

module load pgi64
./laplace2d_acc

If static linking is preferred, there are two options: to link everything statically or to static link only pgi libraries. The first can be achieved with compiler flag -Bstatic and the second with -Bstatic_pgi. With static linking one can avoid having to load pgi module for execution on compute nodes.

Editors

Console editors

Code can be edited directly from the console using nano, emacs or vim. Nano is the most simple editor and requires almost no previous experience, while emacs and vim require getting used to.

As a simple, but more graphical alternative, gedit can be used when ssh is accessed with the -X parameter.

To edit a file, type the editor name followed by the filename, e.g. gedit main.c.

IDEs

Eclipse

A commonly used IDE such as eclipse can be used to edit remote cluster files. From the Window tab, go to Perspective, Open Perspective and find Remote System Explorer. On the left side, click on the dropdown arrow and select New Connection. Select SSH Only, fill in paradox as hostname and proceed without changing anything else. Right click the newly created connection to the left, select Connect and fill out credentials to access the cluster files.

Atom

For Atom, install the remote-ftp package in settings. Then from the Packages tab, select Remote FTP and then Create SFTP config file. The fields you need to fill in are:

user - your username.
promptForPass - set to true to be asked for password on connection. Alternatively, fill in the pass field, but this is not recommended, as it exposes your password.
remote - should be /home/<your_username>. So, if your username was i_love_cats, the field would be /home/i_love_cats.

After that, select the connect option from Packages tab, Remote FTP and provide password to connect. If the remote project directory doesn’t appear, select the Toggle option in Remote FTP.

- PARADOX-IV User Guide