The International symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART) is a forum to present and discuss new research on computing systems utilizing acceleration technology. The main theme of HEART is achieving high efficiency with accelerators, which is of utmost importance across a wide spectrum of computing systems. In the high performance computing and data center domains, high efficiency mostly relates to performance, while in the mobile and IoT space research communities think about accelerators more from a power/energy perspective.
HEART'21 was held as on online symposium. The conference program is available via the timetable link to the left. Video recordings are provided for presentations where the speakers have provided consent.
For further information about past and future editions of HEART please refer to the HEART website.
Session Chair: Paul Chow
In this paper, we focus on data sorting which is one of the basic arithmetic operations and present a sorting library that can be used with the OpenCL programming model for FPGA Our sorting library is built by combining the three hardware sorting algorithms. It consumes more than twice the overall hardware resources compared to a merge sort restructured for the OpenCL programming model for FPGA, but it should be noticed that its operating frequency is 1.09x higher and its sorting throughput is three orders of magnitude better than the baseline.
Although a multi-FPGA system receives attention as a server for MEC (Multi-edge access Computing), traditional design tools only focuses on a single FPGA. Here, a multi-FPGA programming environment based on NEC's integrated design tool CyberWorkBench (CWB) is introduced for a multi-FPGA system FiC (Flow-in-Cloud). Programmers describe their program in SystemC as small modules connected with FIFO channels, then verify the operation with the behavioral simulation considering parallel execution. After the high level synthesis (HLS) is done with CWB, modules distributed to each board is decided, and interface module is inserted. The cycle accurate simulation is applied to ensure the operation and estimate the performance. Finally, generated Verilog HDL code for each board is implemented with Xilinx's Vivado just like the traditional design and configuration is obtained. As an example, a simple convolutional neural network LeNet is described, and implemented on a real system using the tool. Although the cycle accurate simulation takes 105.34sec, the estimated cycles is only 2.2% difference of the real boards execution result. Since the example CNN LeNet is too small, it can be implemented into a single board with traditional design tool.
However, considering the pipeline execution, parallel execution with two boards can distribute the input and output into different FPGAs, and relax the bottleneck.
Multi-access edge computing (MEC) devices that perform processing between the edge and cloud are becoming important in the Internet of Things infrastructure. MEC devices are designed to reduce the load on the edge devices, ensure real-time performance, and reduce the communication traffic between the edge and cloud. In this paper, to enable high-performance and low-power hardware-accelerated processing for different application domains in MEC devices, we propose an automated flow for domain-specific field-programmable gate array intellectual property core (FPGA-IP) generation and testing. First, we perform logic cell exploration using a target user application to find the optimal scalable logic module (SLM) structure, and use the optimal SLM instead of a lookup table to reduce the logic area. Second, we perform routing and FPGA array exploration to determine other FPGA-IP architecture parameters. Finally, the proposed flow uses the explored parameters to automatically generate the entire FPGA-IP and LSI test bitstreams. In a case study, we optimized an FPGA-IP for a differential privacy encryption circuit using the proposed flow. We implemented and evaluated the FPGA-IP with a 55nm TEG chip design. Furthermore, the simulation-based LSI test showed that 100% of the stuck-at faults in the routing paths of the FPGA-IP were detected.
Session Chair: Kentaro Sano
Combinatorial optimization problems are economically valuable but computationally hard to solve. Many practical combinatorial optimizations can be converted to ground-state search problems of Ising spin models. Simulated bifurcation (SB) is a quantum-inspired algorithm to solve these Ising problems. One of the remarkable features of SB is the high-degree parallelism, providing an opportunity for quickly solving those problems by massively parallel processing. In this talk, starting from the principles of SB, we review our recent works on the design and implementation of high-performance FPGA-based accelerators for SB and their applications toward innovative real-time systems that make optimal responses to ever-changing situations. An example of such applications is an ultrafast financial transaction machine that detects the most profitable cross-currency arbitrage opportunities at microsecond speeds. Also, we discuss the parallelism of SB in depth and show a scale-out architecture for SB-based Ising machines with all-to-all spin-spin couplings that allows continued scaling of both machine size and computational throughput by connecting multiple chips, rather than scaling up a single chip.
Session Chairs: Kentaro Sano and Tomohoro Ueno
As the semiconductor technology advancement driven by Moore's law is slowing down, the importance of reconfigurable computing with FPGAs is getting more important with significant existence in the HPC community as well as data center industries. Although FPGA devices and their design tools have advanced much in the last decade, we do not have sufficient knowledge and experience about the infrastructure and operation of FPGA-based or FPGA-involved HPC systems while we have several academic efforts to prototype FPGA-based HPC systems such as Cygnus at University of Tsukuba, Noctua at Paderborn University, ESSPER at RIKEN R-CCS, and so on. In this special session, we have a panel discussion regarding open challenges for Infrastructure and operation of FPGA-based HPC systems followed by Keynote speech and lightning talks on experiences of HPC using FPGAs.
Session Chair: Christian Plessl
Python has become the de-facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We also define a set of around 50 benchmark kernels written in NumPy to evaluate and compare Python frameworks for their portability and performance. We show performance results and scaling across CPU, GPU, FPGA, with 2.47x and 3.75x speedups over previous-best solutions and the first-ever Xilinx and Intel FPGA results of annotated Python.
In recent years, we have witnessed many fundamental changes in the broad context of high-performance computing, including the end of Dennard scaling with a fundamental paradigm shift towards performance scaling, accelerators and heterogeneity with implications on programmability, as well as a widened set of applications such as machine learning. As a result, the complexity of HPC systems is significantly increasing while at the same time user access is broadened to new communities with less expertise on the details of HPC architectures. These two trends are developing in opposite directions, creating respectively widening the gap in between user and hardware. In this session we will learn about two opinions how to tackle this problem: one focuses on improvements of programmability and productivity, while the other considers a wider range of aspects from programmability to operation.
In the past few years, HPC computing has also gained importance in groups of users that in the past did not belong to the classic HPC users, not least because of the increasing popularity of Deep Learning. However, the use of HPC systems has not changed for a long time: Access is often still via the console, jobs are written in Bash scripts - and the most popular programming languages are still C or FORTRAN. These hurdles make it difficult to use HPC systems, especially for new users. In this talk we will show how Python can be used as a programming language for parallel systems, how e.g., GPUs can be programmed using Numba, or applications can be parallelized using DASK. It will be shown how the performance compares to the classical C or MPI approach. Even though in many areas the performance of Python is very good, in many places the interpreter overhead is still a problem even if the computationally intensive parts are executed directly on the CPU or GPU. This is especially a problem for strong scaling. Also, the access to HPC systems must be simplified. Access via Jupyter notebooks or graphical user interfaces simplify the access and reduce the mistakes that especially beginners make. There are also good opportunities for teaching to simplify access to the systems. The use of such interfaces allows only the use of a limited range of functions - this is however for most users of the systems, completely sufficient.
High-Performance Computing (HPC) is at an inflection point in its evolution. General-purpose architectures approach limits in terms of speed and power/energy, requiring the development of specialized architectures to deliver accelerated performance. Additionally, the arrival of new user communities and workloads---including machine learning, data analytics, and quantum simulation---increases the breadth of application characteristics we need to support, putting pressure on the complexity of the architectural portfolio. At the same time, data movement has been identified as a main culprit of energy waste, pushing hardware designers towards a tighter integration of the different technologies. The resulting integrated systems offer great opportunities in terms of power/performance tradeoffs, but also lead to challenges on the software side. In this position paper, we highlight the trends leading us to integrated systems and describe their substantial advantages over simpler, single accelerated designs. Further, we highlight its impact on the corresponding software stack and its challenges and impact on the user. This introduces a different way to design, program, and operate HPC systems, and ultimately to the need to drop some long-held dogmas or believes in HPC systems.
Session Chair: Marco Platzner
With the advancement of HLS technology, FPGA is finally drawing attention as a power-efficient accelerator device. Unlike GPUs, the computation pipeline and FPGA-to-FPGA interconnection can be tightly coupled on FPGAs because they have high-speed serial transceivers on the device itself. The direct connection between computation and network encourages building FPGA clusters with direct high-speed serial links between FPGAs. For these links, commercial IP cores with their proprietary protocol are widely used. However, the commercial IP cores and protocols are not fully customizable. In this paper, an open high-speed serial link controller Kyokko is introduced. Kyokko is based on Xilinx Aurora 64B/66B protocol. It achieves 10+ Gbps of link rates with flow control and channel bonding support within a smaller resource requirement and communication latency than Xilinx's Aurora IP core. The transceiver interface of Kyokko is portable and interoperable between both Intel and Xilinx FPGAs. Also, users can extend the protocol to meet their demands because Kyokko is an open-source project.
The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In this paper, we introduce StreamBrain-- a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems. StreamBrain is a domain-specific language (DSL), similar in concept to existing machine learning (ML) frameworks, and supports backends for CPUs, GPUs, and even FPGAs. We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds, and we are the first to demonstrate BCPNN on STL-10 size networks. We also show how StreamBrain can be used to train with custom floating-point formats and illustrate the impact of using different bfloat variations on BCPNN using the FPGA.
For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations?
In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with particular focus on empirically quantifying our derived performance expectations, and -- should those expectations be undelivered -- investigate whether the introduced data-movement features can offset any eventual loss in performance? We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.
Session Chair: Paul Chow
Massive Multiple In and Multiple Out (MIMO) is being used in the fifth generation of wireless communication systems. As the number of antennas increases, the computational complexity grows dramatically, and this involves matrix calculations with complex numbers. We have designed and implemented a general matrix arithmetic processor to accelerate these calculations, including matrix multiplication, Singular Value Decomposition (SVD), and QR decomposition, on an FPGA. The design can be implemented in fixed-point or single-precision floating-point depending on the requirements of the application. The system behavior can be controlled by instructions such as elementary multiplication, rotation and vector projection, which allows the system to work as a coprocessor in a baseband System on a Chip (SoC). Latency for changing from one matrix computation to another is just a few clock cycles, providing the low latency required for edge processing. The design is implemented and verified on the Xilinx RFSoC development board using fixed point and single precision floating point numbers and matrix sizes of 8 X 8 and 16 X 16.
This paper received the best paper award of the HEART'21 symposium.
Compilation times for large Xilinx devices, such as the Amazon F1 instance, are on the order of several hours. However, today’s data center designs often have many identical processing units (PUs), meaning that conventional design flows waste time placing and routing the same problem many times. Furthermore, the connectivity infrastructure of a design tends to be finalized before the PUs, resulting in unnecessary recompilation of a large fraction of the design.
We present an open source flow where the connectivity infrastructure logic is implemented ahead of time and routed to many interface blocks that border available slots for PUs. As architects iterate on their PU designs, they only need to perform a single set of parallel, independent compile runs to implement and route the PU alongside each distinct interface block. Our RapidWright-based system stitches the implemented PU into the available slots in the connectivity logic, requiring no additional routing to finalize the design. Our system is able to generate working designs for Amazon F1, and reduces compilation time over the standard monolithic compilation flow by an order of magnitude for designs with up to 180 PUs. Our experiments also show that there is future potential for an additional 4X runtime improvement when relying on emerging open source place and route tools.
Session Chair: Marco Platzner
Compute nodes are increasingly being equipped with GPGPUs and FPGA accelerators to increase throughput and energy efficiency. However, they face uncertainty during operation and static operation techniques are not able to tackle it. To this end, this paper investigates the use of the learning classifier system XCS for the assignment of independent tasks. The initial results indicate that the operational environment must exhibit patterns that can be learned and exploited to gain an advantage over static heuristics.
Industrial Analytics are a key part within the transformation process towards Industry 4.0 and smart production systems. The demand for high-performance, low latency edge-computing devices is increasingly satisfied by hardware-software approaches. This extended abstract presents ReconOS64, an approach for enabling thread-based co-working on high-performance reconfigurable Systems-on-Chip.
This work proposes a novel dataflow architecture for Smith-Waterman Matrix-fill and Traceback stages, which are at the heart of short- read alignment on NGS data. The FPGA accelerator is coupled with radical software restructuring to widely-used Bowtie2 aligner to deliver end-to-end speedup while preserving accuracy.
OpenCL-based HLS frameworks are known to reduce the development effort for FPGAs while offering good quality results. The upcoming trend to equip heterogeneous compute clusters with hy- brid networks for inter-CPU and inter-FPGA communication allows scaling FPGA workloads beyond a single FPGA. However, there is no tool for performance characterization of those FPGA-accelerated systems and their communication networks yet.
To fill this gap, we implemented a parametrizable OpenCL bench- mark suite for FPGA based on the HPC Challenge for CPU. In the first step, we showed that the parametrization of the benchmarks al- lows high quality of results on Intel and Xilinx FPGAs. In the future, we plan to extend the benchmarks to scale over multiple FPGAs to create a performance characterization tool for HPC multi-FPGA systems and further investigate the opportunities and limitations of different inter-FPGA communication approaches.
Field Programmable Gate Arrays (FPGAs) have enjoyed significant popularity in financial algorithmic trading. Such systems typically involve high velocity data, for instance arriving from markets, streaming through FPGAs which then undertake real-time transformations to deliver insights within tight time constraints. Such high bandwidth, low latency data processing approaches have proven highly successful in delivering important insights to financial trading floors.
This paper describes CoopCL framework that is specifically de- signed to reduce the multi-device programming complexity. CoopCL consists of a three core components: a C++ API, custom-compiler and a runtime that abstracts and unifies the cooperative workload execution on multi-device heterogeneous platforms. CoopCL takes the OpenCL-C kernel function and automatically uses platform devices to execute it in parallel. We show a runtime-system that transparently manages the data transfers and the synchronization necessary to ensure memory coherency without requiring any ef- fort from the programmer. The programmer is only responsible for developing a dataparallel kernel function in OpenCL, while the sys- tem automatically partitions the workload across an arbitrary set of devices, executes the partial workloads, and efficiently merges the partial results together. Moreover, CoopCL-runtime in combination with a custom-compiler includes an auto-tuning approach based on a platform performance-model. CoopCL is completely portable across different machines including platforms with multi-core CPUs and discrete and/or integrated GPUs.
Talk to the speakers of the PhD Forum in Breakout-Rooms.
Session Chairs: Smail Niar and Ihsen Alouani
Real-time object detection, recognition and tracking is essential for safety critical applications such as autonomous vehicles as well as video analytics, where critical information should be extracted from video streams for applications such as surveillance for security, health and safety monitoring in healthcare and industry, intelligent transportation systems and smart cities. To reduce the latency and security vulnerabilities, processing at the edge devices is critical. However, algorithms usually used in these applications to achieve the required level of accuracy are very computationally intensive. In this talk, two case studies are discussed to present the challenges and techniques of real-time object detection and tracking. First, we’ll present a hardware/software co-design approach for two critical tasks, real-time pedestrian detection, and vehicle detection, which are essential in advanced driving assistance systems (ADAS) and autonomous driving systems (ADS). In the second part of this talk, an initial work on person re-identification and tracking is presented focusing on challenges for the processing at the edge.
Gigantic rates of data production in the era of Big Data, Internet of Thing (IoT), and Smart Cyber Physical Systems (CPS) pose incessantly escalating demands for massive data processing, storage, and transmission while continuously interacting with the physical world under unpredictable, harsh, and energy-/power-constrained scenarios. Therefore, such systems need to support not only the high-performance capabilities under tight power/energy envelop, but also need to be intelligent/cognitive and robust. This has given rise to a new age of Machine Learning (and, in general Artificial Intelligence) at different levels of the computing stack, ranging from Edge and Fog to the Cloud. In particular, Deep Neural Networks (DNNs) have shown tremendous improvement over the past years to achieve a significantly high accuracy for a certain set of tasks, like image classification, object detection, natural language processing, and medical data analytics. However, these DNN require highly complex computations, incurring huge processing, memory, and energy costs. To some extent, Moore’s Law help by packing more transistors in the chip. But, at the same time, every new generation of device technology faces new issues and challenges in terms of energy efficiency, power density, and diverse reliability threats. These technological issues and the escalating challenges posed by the new generation of IoT and CPS systems force to rethink the computing foundations, architectures and the system software for embedded intelligence. Moreover, in the era of growing cyber-security threats, the intelligent features of a smart CPS and IoT system face new type of attacks, requiring novel design principles for enabling Robust Machine Learning.
In my research group, we have been extensively investigating the foundations for the next-generation energy-efficient and robust AI computing systems while addressing the above-mentioned challenges across the hardware and software stacks. In this talk, I will present different design challenges for building highly energy-efficient and robust machine learning systems for the Edge, covering both the efficient software and hardware designs. After presenting a quick overview of the design challenges, I will present the research roadmap and results from our Brain-Inspired Computing (BrISC) project, ranging from neural processing with specialized machine learning hardware to efficient neural architecture search algorithms, covering both fundamental and technological challenges, which enable new opportunities for improving the area, power/energy, and performance efficiency of systems by orders of magnitude. This talk will pitch that a cross-layer design flow for embedded machine learning/EdgeAI, that jointly leverages efficient optimizations at different software and hardware layers, is a crucial step towards enabling the wide-scale deployment of resource-constrained embedded AI systems like UAVs, autonomous vehicles, Robotics, IoT-Healthcare / Wearables, Industrial-IoT, etc.
Field Programmable Gate Arrays (FPGAs) are a promising platform for accelerating image processing as well as AI-based applications due to their parallel architecture, reconfigurability and energy-efficiency. However, programming such reconfigurable computing platforms can be quite cumbersome and time consuming compared to CPUs or GPUs. This presentation shows concepts and realizations for reducing the programming effort for FPGAs. We propose a reconfigurable RISC-V based many-core architecture together with our Open-VX based design methodology. For generating image processing and neural network accelerators, we provide an open-source High Level Synthesis (HLS) library called HiFlipVX. The importance of such an approach is shown with several research projects in the automotive and robotics domain.