Speaker
Mads Kristensen
(Niels Bohr Institute, UCPH)
Description
The Bohrium runtime system exploits an array programming approach from a high-level
language to extract parallelism and accelerate execution on a variety of hardware setups. By presenting the user with an array programming abstraction, it is possible to extract a high level of parallelism without requiring the programmer to adjust the program to fit the current execution environment.
Bohrium can use the NumPy library as the implementation of an array programming
approach and offload computations to the desired hardware. With such a setup, the
programmer can develop the applications and algorithms entirely within a familiar
environment, and later decide to use the Bohrium runtime system to accelerate the
execution with GPGPUs or with a cluster installation.
In some ways this is similar to Chapel and other languages, however Bohrium does not
require a specific language, instead it plugs into existing programming languages as a
library. This removes the need for special toolchains, libraries and other surrounding support entities that are required for a separate language. As each language integration simply calls into a standard C interface, there is no dependence on the execution model from the programming language. This allows the programmer to switch the execution target through a configuration file or environment variable without making any modifications to the source program.
Once a program is running and calling the Bohrium library, all requested operations are
encoded as array bytecode operations, and collected for execution in a lazy evaluation
manner. Once a result is required by the top level program, the collected array bytecodes
are rearranged to fit with the current target architecture constraints before being passed on.
The actual execution is performed in a manner that seeks to optimize for the characteristics on the actual execution device. For the GPGPU backend, one such optimization is to schedule data transfer such that they overlap computations and thus hide the latency inherent in the GPGPU communication. For the CPU backend, this means distributing data in a NUMA-aware fashion to exploit the full memory bandwidth as well as utilizing JIT compilation and OpenMP to execute the bytecode sequence.
First we present the current state of the Bohrium project, including features, performance
measurements and caveats. Then we present the major ongoing projects within the Bohrium system: the Niels language, multi-core optimizations, Xeon Phi execution, FPGA execution, GPGPU targets, linear algebra packages, distributed setups and more.
Primary author
Mads Kristensen
(Niels Bohr Institute, UCPH)