It is often desired in computation-rich applications to take some of the load off the CPU, by letting a piece of hardware perform the heavy number crunching. GPUs are commonly used for this purpose, but they impose strict limitations on the accelerated algorithm, that must be met in order to achieve an effective speedup. Programmable logic (FPGAs) pose an attractive alternative in many cases, as the accelerating hardware doesn’t have a hardwired architecture of its own. Rather, the algorithm in question shapes the architecture; logic building blocks are placed in parallel or pipelined to achieve a high utilization of the device’s capacity, depending on the expected data flow.
When employing traditional logic design techniques, the human effort necessary to harness the FPGA’s capabilities for a complex algorithm often makes the FPGA an unattractive choice. The manpower with adequate skills for this task is not always available.
High-Level Synthesis (HLS) is a recent technique for utilizing programmable logic without using the traditional hardware definition languages (Verilog / VHDL) and with no need for prior knowledge of FPGA/VLSI design practices. The HLS tools compile a C/C++ function into logic elements, aiming to utilize the programmable device efficiently for speedy operation and economic resource usage. This opens an opportunity for regular programmers to write custom coprocessing logic without being FPGA experts.
It's recommended to have a look on Xilinx' User Guide to HLS for more insights.
Interfacing with the FPGA
While HLS reduces the needed knowledge and effort for translating the C/C++ function into a logic module, there is still a need to interface between the logic fabric and the computer program using the coprocessing feature. A fast, DMA-based bidirectional data transport needs to be set up, including logic and host drivers, in order to achieve a performance that justifies coprocessing. The task of setting up this interface may require more efforts and knowledge than designing the core function’s logic implementation, in particular when HLS is used.
Xillybus’ IP core and host drivers supply a simple end-to-end solution for the data transport. Together with HLS, a fullblown coprocessing system having a simple programming interface can be set up without any FPGA expertise. The combination of Xillybus and HLS dramatically simplifies the process of setting up an HLS-based logic design for PCIe-enabled FPGAs or Xilinx’ Zynq processor.
The key features of this combination are:
- No FPGA-related knowledge is required
- No need to develop any hardware drivers: All communication with the FPGA is done by a plain user-space program
- Supports Linux and Windows for the host application
- Older FPGA families are supported (including Spartan-6, Virtex-5 and Virtex-6)
- Simple printf-like debugging is available on the logic as it runs on the FPGA.
Outline of software
For the sake of simplicity, we’ll assume that there is a single C/C++ function which completely performs the part of the algorithm for which hardware offloading is desired. This function may preserve a state (i.e. have static variables). The suggested solution doesn’t force this limit, but having more than one offload function requires some rather trivial technical modifications, which are left out to keep things simple.
Three software elements are involved:
- The host program — The C/C++ program running on a CPU (or several CPUs)
- The synthesized function — The C/C++ function or method which implements the part for which hardware acceleration is desired. This function (and functions it calls) is compiled (”synthesized”) into logic elements and runs on the FPGA.
- The wrapper function — A small piece of code which handles the interface between the host program and the synthesized function. This function is compiled along with the synthesized function, and also runs on the FPGA.
Unlike common C/C++ programming, the host program doesn’t call the synthesized function. Rather, it organizes the data needed for executing the function in a data structure and transmits it to the synthesized function, using a simple API, which is described further on in this guide. At a later stage, it collects the return data as a data structure sent from the synthesized function with a similar API.
Also unlike common C/C++, it’s usually very ineffective to wait for the results from one call before sending the data for the next one. The translation of the synthesized function into logic elements is likely to allow a possibly vast number of pipelining stages. As a result, the logic may be able to process many sets of data at the same time by virtue of pipelining. The best software strategy is hence to attempt sending data for processing as soon as it’s available to the host program, and have another thread or process collect the results asynchronously.