Speaker
Prof.
Brian Vinter
(Niels Bohr Institute, University of Copenhagen)
Description
In business, Big Data analysis is most often managed with the Hadoop system[1]. The popularity of Hadoop has made the system well known and widely supported. Thus as may be expected, the use of Hadoop for scientific data analysis has also been widely investigated[2][3]. Hadoop however, has several design choices that makes is less than optimal for scientific data processing. Most importantly Hadoop, and the underlying file system HDFS, assumes that data is a sequence of bytes, while in science most data is not one dimensional and splitting data at arbitrary points means that structures of higher dimensions, 2D images, 3D volumes or NetCDF data is split across two nodes and processing thus require inter-node communication during data processing. In addition, Hadoop requires analysis scripts to be written in Java, which is perfectly possible for scientific data, but most often not the most convenient programming language. In this talk we introduce the Big Data Analysis Engine, BDAE(/b’dei/). In BDAE data is saved in a way that preserves the semantic structure of the data, i.e. images or volumes are kept at a single node, and NetCDF data is kept at a one node for individual records, but different fields are split into different datasets, i.e. climate data is split between temperature, pressure, … so that the entire dataset need not be traversed to do analysis on a single component. Rather than Java BDAE scripts are written in Python, and since the data structures are kept in BDAE, the programming interface is much simpler; a programmer can have all elements in a dataset presented as simple iterator, i.e.; for temp in temperature:…
Data in BDAE is seamlessly distributed across the storage nodes in a data-analysis cluster, round-robin for datasets of unknown size at the time the dataset is created, i.e. streaming input, and in equal size chunks for datasets where the total size is known at creation time, i.e. a NetCDF file that is imported into BDAE. Datasets may be replicated within a BDAE cluster, this means that data that must be kept highly available may be replicated at two or more nodes, while other datasets are not replicated. Replicated and non-replicated data coexists on the same cluster, this is also a unique feature for BDAE. The talk will introduce BDAE, some internal features and show examples of how parallel data analysis scripts are easily written without the programmer being exposed to parallelism or data distribution. Examples include image analysis, tomographic reconstruction as in Fig 1, and statistics on a NetCDF dataset.
![Fig. 1 BDAE tomographic reconstruction. Middle level is a sketch to indicate a partial volume.
][1]
References
1. Apache Software Foundation. Hadoop.
2. Fadika, Zacharia, et al. "Evaluating hadoop for data-intensive scientific operations." Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 2012.
3. Buck, Joe B., et al. "SciHadoop: array-based query processing in Hadoop."Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011.
[1]: https://sid.erda.dk/seafile/f/6a1344be6a/
Primary author
Prof.
Brian Vinter
(Niels Bohr Institute, University of Copenhagen)
Co-author
Mr
Steffen Karlsson
(NBI, UCPH)