16–19 Oct 2016
Copenhagen University
Europe/Copenhagen timezone

Portable Parallelization with the Bohrium Runtime System

17 Oct 2016, 14:55
5m
Marble Hall (Copenhagen University)

Marble Hall

Copenhagen University

Thorvaldsensvej 40
Mini Oral Contributions 2

Speaker

Mads Kristensen (Niels Bohr Institute, UCPH)

Description

The Bohrium runtime system exploits an array ­programming approach from a high-­level language to extract parallelism and accelerate execution on a variety of hardware setups. By presenting the user with an array ­programming abstraction, it is possible to extract a high level of parallelism without requiring the programmer to adjust the program to fit the current execution environment. Bohrium can use the NumPy library as the implementation of an array­ programming approach and offload computations to the desired hardware. With such a setup, the programmer can develop the applications and algorithms entirely within a familiar environment, and later decide to use the Bohrium runtime system to accelerate the execution with GPGPUs or with a cluster installation. In some ways this is similar to Chapel and other languages, however Bohrium does not require a specific language, instead it plugs into existing programming languages as a library. This removes the need for special toolchains, libraries and other surrounding support entities that are required for a separate language. As each language integration simply calls into a standard C interface, there is no dependence on the execution model from the programming language. This allows the programmer to switch the execution target through a configuration file or environment variable without making any modifications to the source program. Once a program is running and calling the Bohrium library, all requested operations are encoded as array byte­code operations, and collected for execution in a lazy­ evaluation manner. Once a result is required by the top ­level program, the collected array byte­codes are rearranged to fit with the current target architecture constraints before being passed on. The actual execution is performed in a manner that seeks to optimize for the characteristics on the actual execution device. For the GPGPU backend, one such optimization is to schedule data transfer such that they overlap computations and thus hide the latency inherent in the GPGPU communication. For the CPU backend, this means distributing data in a NUMA-­aware fashion to exploit the full memory bandwidth as well as utilizing JIT compilation and OpenMP to execute the bytecode sequence. First we present the current state of the Bohrium project, including features, performance measurements and caveats. Then we present the major ongoing projects within the Bohrium system: the Niels language, multi-core optimizations, Xeon Phi execution, FPGA execution, GPGPU targets, linear algebra packages, distributed setups and more.

Primary author

Mads Kristensen (Niels Bohr Institute, UCPH)

Presentation materials