Possible future project: XillyHPC, Xillybus IP for hardware acceleration

Introduction

Different future adaptations of the Xillybus IP core are being examined in order to better meet emerging needs in the industry. High Performance Computing being on the headlines lately, it has been considered to introduce a specialized IP core for that purpose, XillyHPC. However as the interest for such project appears to be low, XillyHPC is currently not on Xillybus' project roadmap.

This page describes the ideas behind this possible project, allowing potential users to show their interest and issue their comments. The decision whether and how to develop and release XillyHPC depends much on the community's reaction.

If you believe that you could benefit from this project, by all means please make yourself heard: An email, describing your needs and to what extent they're met with existing solutions will be appreciated. In the absence of interest, this project is on hold.

Each player in the industry has its own take on future and challenges of High Performance Computing. The rest of this page presents Xillybus' view, which is open for discussion and refinement.

Acceleration concepts

No matter how you twist and turn it, the clock rate of any acceleration hardware is always lower than the processor that uses it, in particular when the acceleration in implemented on an FPGA. The only way to gain performance is by having a lot of operations in parallel on each clock cycle.

It appears like there are two distinct principles for achieving this:

"Serialized" acceleration: Multiple operations take place simultaneously by pipelining the algorithm's steps. This allows a high execution throughput, but for this to be beneficial, the algorithm must be such that its operations can be organized in a long pipeline. For example, some block encryption algorithms consist of a long sequence of manipulations, where each step depends only on the data from the previous one. High Level Synthesis (HLS) tools typically benefit from this type of pipelining.
"Parallelized" acceleration: Multiple operations take place simultaneously by virtue of multiple execution units (e.g. "shaders") working in parallel. The level of dependency between these execution units varies between implementations, but it's common to require that all execution units run the same execution flow, only with different data fed into each unit. This principle stands behind GPGPUs and all implementations of OpenCL. Computer graphics goes hand in hand with this methodology, as it often involves executing the same operation on a large amount of pixels.

As far as Xillybus is concerned, "serialized" acceleration is already covered, as the host/FPGA communication for such acceleration scheme requires exactly what Xillybus offers: Streams of data going back and forth.

However Xillybus IP core is less useful for "parallelized" acceleration, which seems to be the direction recently taken by both Altera and Xilinx, both presenting an OpenCL frontend for a non-HDL flow for generating the application logic. This is where XillyHPC comes in.

Feeding the beast

When accelerating an algorithm for a real-life application, quite often the performance bottleneck turns out to be not the processing power and its utilization, but data starvation. It's not only a matter of raw data bandwidth, but the access pattern plays a crucial role, as both memory (most of which is DDR SDRAMs) and transport (internal and peripheral buses alike) are burst-oriented. This infrastructure typically allows for an impressive throughput when chunks of data are required, but may become extremely inefficient when the access pattern is sporadic. The access latency can also contribute its share in slowing down the processing.

The root of this problem is the DDR memories themselves: As their clock frequencies rise, their access latency, measured in clock cycles, rises as well. More data can be accessed at a given time period, but the inherent latencies for requesting and releasing a certain row in the memory array, remain rather high. Sophisticated processors employ caching mechanisms that manage to hide the bursty nature of the data access infrastructure, allowing software with a large range of access scenarios to run efficiently. GPGPUs are equipped with other solutions for tackling the same problem.

The caching mechanisms are often overlooked, even though they play a crucial role in the de-facto performance obtained from any processing machine. The difficultly to understand and quantify these mechanisms is probably the reason why so little attention is given to this important topic.

When using an FPGA for accelerating, there is no out-of-the-box caching mechanism to rely on. As many FPGA engineers know too well, a large part of the design of any FPGA logic involving data processing, is to bring the right piece of data to the right place at the right time.

There is no way around this issue: If any given FPGA logic issues memory (or PCIe) bus requests for each byte or 32-bit word it needs to access at that moment, the bus infrastructure will become saturated with inefficient requests, which will also make bad use of the DDR memories. This results in a catastrophic performance.

The suggested solution

Xilinx has already presented a solution for translating a C function into FPGA logic -- Vivado HLS, and Intel (formerly Altera) has also announced its HLS compiler. Accessing data on the FPGA's own small RAMs is of course fast. So wouldn't it be nice if there was a cache controller that copied data for processing from the main memory (i.e. DDR SDRAMs) to the FPGA's RAM in advance? And then, when the processing for this part is done, write back the results into the correct place in the main memory?

This brings us to the main point: The programmer of the HLS functions can easily plan what memory segments in the main memory to load before processing, and where to write data back to. So the synthesized HLS function can begin with calling routines that copy data from the main memory into a local array, do the processing, and then call routines that copy the results back to the main memory. These routines would be plain C functions, that end up as logic on the FPGA. This logic interacts, in turn, with an external framework, which routes and muxes the memory access requests to a bus interface (e.g. a PCIe front end).

In short: Let the synthesized function be its own cache controller. It may seem a waste of time to wait for these memory reads and writes to finish, but there's in fact no need to: Except for the first set of chunks, it's quite simple to allow the processing of one set of data while the memory transaction for the next set takes place.

All in all, there will be a large number of independent execution units, each fetching its own data from the main memory before processing, and writing the results back afterwards.

This allows an efficient utilization of the total PCIe bandwidth, and works well with the latencies and bursty nature of the bus infrastructure.

What XillyHPC facilitates

Unlike the regular Xillybus IP core, XillyHPC mainly serves I/O requests originating from the logic, and not from the host. The host's software sets up the array of data for processing, controls the execution flow and consumes the results, but the vast majority of XillyHPC's traffic takes place without the host's involvement.

Scalability is the key issue. It must be easy to deploy and manage a large number of independent instances of the synthesized HLS functions (i.e. the execution units), and manage their simultaneous execution.

The XillyHPC hence provides:

The instantiation of a large number of instances of the synthesized HLS function: This allows the user of the core to set the number of desired execution units, and let XillyHPC's framework create the wrappers that instantiate and wire the logic module's inputs and outputs.
The management of the cluster of execution units: XillyHPC divides the workload of a large task, assigning each execution unit in the cluster its part. The host's application software doesn't have to be aware of the number of execution units.
Arbitration, routing and management of memory access requests from the execution units in a scalable manner. This is the logic machinery that allows the HLS function to issue calls to main memory read/write routines.

Summary of XillyHPC's main principles

There are several execution units, each requesting memory reads and writes into a local buffer
Each execution unit fetches memory segments at the beginning of the HLS function, runs a loop executing the desired operation on local memory, and then writes back the results. Then the function "returns" (effectively signals XillyHPC that it's ready for another round).
Each execution unit is given its index number as a distinct parameter, and a set of parameters given to all executions.
A memory operation is “read/write X words from main memory offset Y into local memory address Z”. Or vice versa.
The memory operations may depend on the index directly, or may be calculated depending on the outcome of the execution itself.
XillyHPC maintains the pool of execution units, kicks off executions and informs the host when the entire set of executions is finished.
The number of executions required for a task may very well exceed the number of execution units
Each execution unit may have more than one local memory. In other words, several separate FPGA RAMs can be used by a single execution module. This is a rather straightforward result of the ability to define multiple arrays in HLS C.