EASC 2018 - The Exascale Applications and Software Conference

2018-04-17 - 2018-04-19

Tuesday 17th April, 15:30-16:20 - Efficient Gather-Scatter Operations in Nek5000 Using PGAS

Niclas Jansson, Nick Johnson, and Michael Bareford

Gather-scatter operations are one of the key communication kernels used in the computational fluid dynamics (CFD) application Nek5000 for fetching data dependencies (gather), and spreading results to other nodes (scatter). The current implementation used in Nek5000 is the Gather-Scatter library, GS, which utilises different communication strategies: nearest neighbour exchange, message aggregation, and collectives, to efficiently perform communication on a given platform. GS is implemented using non-blocking, two-sided message passing via MPI and the library has proven to scale well to hundreds of thousands of cores. However, the necessity to match sending and receiving messages in the two-sided communication abstraction can quickly increase latency and synchronisation costs for very fine grained parallelism, in particular for the unstructured communication patterns created by unstructured CFD problems.

ExaGS is a re-implementation of the Gather-Scatter library, with the intent to use the best available programming model for a given architecture. We present our current implementation of ExaGS, based on the one-sided programming model provided by the Partitioned Global Address Space (PGAS) abstraction, using Unified Parallel C (UPC). Using a lock-free design with efficient point-to-point synchronisation primitives, ExaGS is able to reduce communication latency compared to the current two-sided MPI implementation. A detailed description of the library and implemented algorithms are given, together with a performance study of ExaGS when used together with Nek5000, and its co-design benchmarking application Nekbone.

Wednesday 18th April, 11:00-12:15 - OpenACC accelerator for the PN –PN-2 algorithm in Nek5000

Evelyn Otero, Jing Gong, Misun Min, Paul Fischer, Philipp Schlatter and Erwin Laure

Nek5000 is an open-source code for the simulation of incompressible flows. Nek5000 is widely used in a broad range of applications, including the study of thermal hydraulics in nuclear reactor cores, the modeling of ocean currents, and the study of stability, transition and turbulence on airplane wings. Exascale HPC architectures are increasingly prevalent in the Top500 list, with CPU based nodes enhanced by accelerators or co-processors optimized for floating-point calculations. We have previously presented a serial case studies of partially porting to parallel GPU-accelerated systems for Nek5000/Nekbone, see [1–3]. In this paper, we expand our previously developed work and take advantage of the optimized results to port the full version of Nek5000 to GPU-accelerated systems, especially regarding the PN –PN-2 algorithm. This latter algorithm is a way to de-couple the momentum from the pressure equations that does not lead to spurious pressure modes. It is more efficient than other methods, but it involves different approximation spaces for velocity (order N) and pressure (order N-2). The paper focuses on the technology watch of heterogeneous modelling and its impact on the exascale architectures (e.g. GPU accelerators system). In fact GPU accelerators can strongly speed up the most consuming parts of the code, running efficiently in parallel on thousands of cores. The goal of this work is to investigate if the PN –PN-2 algorithm can take advantage of hybrid architectures and be used in Nek5000 to improve its scalability to exascale. In this talk, we describe the GPU implementation of PN –PN-2 algorithm in Nek5000, namely:

  • The use of GPU-direct to communicate directly between GPU memory spaces without involving the CPU memory. For this work, we use an OpenACC accelerated version of Nek5000 which is already implemented in the MPI communication library gs [3].
  • The initial profiling and assessment of suitability of the code for the most time consuming subroutines.
  • The implementation of the OpenACC version for the multigrid solver.

In addition we present the initial performance results of the OpenACC version of PN –PN-algorithm for a typical production problem. Finally we discuss the experience and the challenges we faced during this work.

Wednesday 18th April, 16:00-17:40 - Wavelet-Based Compression Algorithm

Patrick Vogler, Ulrich Rist

The steady increase of available computer resources has enabled engineers and scientists to use progressively more complex models to simulate a myriad of fluid flow problems. Yet, whereas modern high performance computers (HPC) have seen a steady growth in computing power, the same trend has not been mirrored by a significant gain in data transfer rates. Current systems are capable of producing and processing high amounts of data quickly, while the overall performance is oftentimes hampered by how fast a system can transfer and store the computed data. Considering that CFD researchers invariably seek to study simulations with increasingly higher spatial and temporal resolution, the imminent move to exascale computing will consequently only exacerbate this problem [1]. Using the otherwise wasted compute cycles to create a more compact form of a numerical dataset, one could alleviate the I/O bottleneck by exploiting it’s inherent statistical redundancies. Since effective data storage is a pervasive problem in information technology, much effort has already been spent on adapting existing compression algorithms for floating-point arrays.

In this context, Loddoch and Schmalzl [1] have extended the Joint Photographic Experts Group (JPEG) standard for volumetric floating-point arrays by applying the one-dimensional real-to-real discrete cosine transform (DCT) along the axis of each spatial dimension, using a variable-length code to encode the resulting DCT coefficients. Lindstrom [2], on the other hand, uses a lifting based integer-to-integer implementation of the discrete cosine transform, followed by an embedded block coding algorithm based on group testing. While these compression algorithms are simple and efficient in exploiting the low frequency nature of most numerical datasets, their major disadvantage lies in the non-locality of the basis functions of the discrete cosine transform. Thus, if a DCT coefficient is quantized, the effect of a lossy compression stage will be felt throughout the entire flow field [3]. To alleviate this, the numerical field is typically divided into small blocks and the discrete cosine transform is applied to each block one at a time. While partitioning the flow field also facilitates random-access read and write operations, this approach gives rise to block boundary artifacts which are synonymous with the JPEG compression standard.


In order to circumvent this problem we propose to adapt the JPEG-2000 (JP2) compression standard for volumetric floating-point arrays. In contrast to the baseline JPEG standard, JPEG-2000 employs a lifting-based one-dimensional discrete wavelet transform (DWT) that can be performed by either the reversible LeGall-(5,3) taps filter for lossless or the non reversible Daubechies-(9,7) tabs filter for lossy coding [3]. Due to its time-frequency representation, which identifies the time or location at which various frequencies are present in the original signal, the discrete wavelet transform allows for the entire frame to be decorrelated concurrently. This eliminates blocking artifacts at high compression ratios commonly associated with the JPEG standard. We therefore demonstrate the viability of a wavelet-based compression scheme for large-scale numerical datasets.


All Dates

  • From 2018-04-17 to 2018-04-19

Powered by iCagenda