System Benchmarks

This section describes the benchmarks used to explore the intrinsic performance of a system in terms of memory system bandwidth (Stream) and overheads of common HPC programming models such as OpenMP and MPI (EPCC OpenMP and EPCC OpenMP/MPI). In the next table, we provide the source code and workload configurations used to test each MEEP environment:

Benchmark name MEEP Repository
Stream meep-bench-method branch
EPCC-OpenMP meep-bench-method branch
EPCC-OpenMP/MPI meep-bench-method branch

HPC Benchmarks

This section lists and describes the set of HPC benchmarks selected to analyze the performance of all available MEEP environments. Specifically, the information shown here as the intent of reproducing the reported performance analysis.

Benchmark name MEEP Repository
RISC-V Benchmarks meep-bench-method branch
HPL meep-bench-method branch
HPGC meep-bench-method branch
FFTXLIB meep-bench-method branch
CloudMicroPhysics NA
ADvectionMPData NA

Data Analytics Benchmarks

TensorFlow Lite models

Given that TensorFlow lite does only inference, we use pre-trained models. Over the trained model we use a synthetic benchmark to assess inference timings over the model. The models are a set of neural networks representative of current data analytics architectures. The provided pre-trained models are:

  • MNIST: its input is a set of 10 hand-written numbers from 0 to 9. It identifies the corresponding hand-written number. It is widely used as a hello world for deep learning.
  • VGG-19: VGG19 is a variant of VGG model, which in short consists of 19 layers (16 convolution layers, 3 Fully connected layers, 5 MaxPool layers, and 1 SoftMax layer). There are other variants of VGG like VGG11, VGG16, and others. VGG19 has 19.6e9 FLOPs.
  • ResNet50: ResNet50 is a variant of the ResNet model, which has 48 Convolution layers along with 1 MaxPool and 1 Average Pool layer. It has 3.8e9 FLOPS. It is a widely used ResNet model.
  • MobileNet: the MobileNet model is based on depthwise separable convolutions, which is a form of factorized convolutions that factorize a standard convolution into a depthwise convolution and a 1x1 convolution called a pointwise convolution.
The benchmark consists of taking an input graph and an input image. Then, it runs inference over 50 iterations (parametrizable) and outputs the average inference time and standard deviation, as well as the fastest and longest inference timings. The pre-trained models are offered as an additional RPM package which can be found at TFLite models.

Apache Spark - Epistasia use case

Epistasis is the interaction between genes that influences a phenotype. Genes can either mask each other so that one is considered “dominant,” or they can combine to produce a new trait. It is the conditional relationship between two genes that can determine a single phenotype of some traits. An HPC application, Epistasia, has been developed to find all these interactions. The application uses Apache Spark to move the data from disk to memory. Since genome data is massive, the genome is split into smaller partitions. Once each of the partitions is moved in memory by Spark, the application leverages numpy to make the computational part. The Epistasis use-case RPM can be found at Epistasis

Workflow Benchmarks

One part of the MEEP Software Stack is devoted to the develoment and orchestration of parallel and distributed workflows with COMPSs. In this section, we present a set of Workflows implemented with PyCOMPSs (the Python binding of COMPSs) which could take benefit of the MEEP platform capabilities. In the first part of the section, we present a set of dislib algorithms which implement distributed worklfows for ML. In the second part of the section, we present another workflow use case which is focused on Hyper-Dimentional Computing.

Dislib Workflows

The Distributed Computing Library is a machine learning library that is built on top of PyCOMPSs, thus provides machine learning algorithms that are distributed and parallel. The library focuses on the execution of data analysis algorithms on distributed platforms such as supercomputers. The workflows evaluated in MEEP are: Matrix Multiplication, QR Decomposition, Cascade Suport Vector Machines, K-Means, Random-Forest Classifier and Gaussian Mixture Model. The code of these workflows can be found in this repository

Hyper-Dimentional Computing Workflow

Hyper-Dimensional Computing, also known as Vector Symbolic Architecture, is a computing framework that tries to emulate the animal nervous system. It does so by representing the space using the properties of high-dimensional random vectors. The higher level idea is to represent information (x) in space (IS) by protecting it with large dimensions (d), usually those being 10.000. The space is usually represented as binary HS={0,1}^d or bipolar HS={-1,1}^d. One essential part of HDC is encoding the data, therefore it is needed to create a mapping from the data space (IS), to the hyperspace (HS). The encoding has the property that vectors are holographic. In itself this means that the dimensions of the hypervectors are independent and identically distributed, this allows the hypervectors to be robust and each dimension carries the same amount of information. We have parallelized this framework with PyCOMPSs creating a coarse-grain task-based parallel and distributed version of the computation. The different coarse grain tasks have been internally parallelized using Liniar algebra libraries such as BLIS. The code of this workflow can be found in this repository

Systolic Arrays Benchmarks

One part of the MEEP Software Stack is devoted to the develoment and evaluation of the Systolic Array's co-processors. In this section, we present a set of benchmarks ported to use the systolic array assembly instructions implemented in the LLVM compiler to leverage this ISA extensions.

First, we present the benchmark used to evaluate the SA-HEVC co-processor; second, the kernels used to evaluate the SA-NN co-processor.

SA-HEVC Benchmarking

Bolt65 is a performance-optimized HEVC hardware/software suite for Just-in-Time video processing developed as a part of the research activities of the HPC Architecture and Application Research Group at the Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia. Bolt65 is a “clean-room” suite that consists of an encoder, decoder, and transcoder based on HEVC standard. The special focus in the development of the Bolt65 is set on the performance efficiency achieved by low-level optimizations and hardware-software co-design adapted for the efficient exploitation of heterogeneous accelerator-based architectures.

Within the MEEP project, Bolt65 is being used for the validation and verification of SA-HEVC and evaluation final MEEP platform with integrated SA-HEVC.

The main benchmark of Bolt65 in MEEP is a performance comparison between two different implementations of Bolt65 encoder, decoder, and transcoder:

  • basic scalar implementation.
  • implementation that uses a specialized systolic array (SA-HEVC) designed and implemented as a part of the MEEP project.

The code of this benchmark can be found in this repository

SA-NN Benchmarking