## SCL Seminar by Vladimir Loncar

SCL seminar of the Center for the Study of Complex Systems, will be held on Friday, 25 November 2016 at 14:00 in the library reading room “Dr. Dragan Popović" of the Institute of Physics Belgrade. The talk entitled

will be given by Vladimir Lončar (Scientific Computing Laboratory, Center for the Study of Complex Systems, Institute of Physics Belgrade).

We will present parallelization of a semi-implicit split-step Crank-Nicolson algorithm solving dipolar Gross-Pitaevskii equation. Four parallel algorithms will be presented: C implementation parallelized with OpenMP targeting single shared memory system [1], CUDA implementation targeting single Nvidia GPU [2], and their parallelizations to distributed memory systems using MPI [3]. We will first give an overview of the split-step Crank-Nicolson method and describe how the dipolar term is computed using FFT, which forms the basis of all presented algorithms. We will then move on to describing the concepts used in each of the parallel implementations, and finally we will present a performance evaluation of each algorithm. In our tests OpenMP implementation demonstrates a speedup of 12 on a 16-core workstation, CUDA version has a speedup of up to 25, while the MPI parallelization yields a further speedup of 16 for the OpenMP/MPI version, and speedup of 10 for the CUDA/MPI version.

[1] D. Vudragović, et. al., Comput. Phys. Commun. 183, 2021 (2012).

[2] V. Lončar, et. al., Comput. Phys. Commun. 200, 406 (2016).

[3] V. Lončar, et. al., Comput. Phys. Commun. 209, 190 (2016).

**"Parallel algorithms for solving dipolar Gross-Pitaevskii equation"**will be given by Vladimir Lončar (Scientific Computing Laboratory, Center for the Study of Complex Systems, Institute of Physics Belgrade).

**Abstract of the talk:**We will present parallelization of a semi-implicit split-step Crank-Nicolson algorithm solving dipolar Gross-Pitaevskii equation. Four parallel algorithms will be presented: C implementation parallelized with OpenMP targeting single shared memory system [1], CUDA implementation targeting single Nvidia GPU [2], and their parallelizations to distributed memory systems using MPI [3]. We will first give an overview of the split-step Crank-Nicolson method and describe how the dipolar term is computed using FFT, which forms the basis of all presented algorithms. We will then move on to describing the concepts used in each of the parallel implementations, and finally we will present a performance evaluation of each algorithm. In our tests OpenMP implementation demonstrates a speedup of 12 on a 16-core workstation, CUDA version has a speedup of up to 25, while the MPI parallelization yields a further speedup of 16 for the OpenMP/MPI version, and speedup of 10 for the CUDA/MPI version.

[1] D. Vudragović, et. al., Comput. Phys. Commun. 183, 2021 (2012).

[2] V. Lončar, et. al., Comput. Phys. Commun. 200, 406 (2016).

[3] V. Lončar, et. al., Comput. Phys. Commun. 209, 190 (2016).