The guide to Xillybus Block Design Flow for non-HDL users (deprecated)

4 Acceleration / coprocessing best practices

4.1 Throughput vs. latency

There’s a significant difference between traditional hardware acceleration, which is based upon enhanced instruction sets (e.g. the x86 family’s MMX command, crypto extensions for AES, and ARM’s NEON extension) and acceleration with external hardware, such as GPGPU and FPGA. Because the enhanced instruction sets are part of the processor’s execution flow, they replace a long sequence of machine code instructions with a shorter one, and reduce the number of cycles required until the result is available.

External hardware acceleration (FPGA acceleration included) on the other hand, does not necessarily reduce the time until the result is available, due to the significant latency of transporting the data to and from the external hardware. In addition, the processing time may also be significantly longer than the processor’s, due to pipelining, and possibly a lower frequency of its clock.

Hence, the advantage of external hardware acceleration is not latency (how fast the result is obtained) but throughput (the rate at which the data is handled). In order to utilize this advantage, it’s important to maintain a flow of data going to and from the accelerating hardware, rather than waiting for the results of one operation before initiating the next one.

The technique for proper acceleration with an FPGA is elaborated in section 6.6 of either of these two documents:

4.2 Data width and performance

For applications that require relatively high data bandwidths, it’s recommended to use 32-bit wide streams (or wider) for the data-intensive streams. This is because 8 and 16-bit wide streams utilize the host’s bus less efficiently.

The reason is that the words are transported through the Xillybus internal data paths at the rate of the bus. As a result, transporting an 8-bit word takes the same time slot as a 32-bit word, making it effectively four times slower.

This also impacts other streams competing for the underlying transport at a given time, since the data paths become occupied with slower data elements.

This guideline doesn’t apply to revision B/XL/XXL Xillybus IP cores, which transports narrow streams with the same efficiency.

4.3 Do’s and don’ts

There are a few issues to note when working with the Block Design Flow:

Streams with address ports (“address/data streams”, “seekable streams”) are not supported in the Block Design Flow. If the Xillybus IP Core includes such streams, they do not appear as ports in the GUI, but do appear normally on the host side. Attempts to read from such a stream on the host will yield an immediate end-of-file condition. A write() function call will not return, as there’s no data sink on the other end.

It’s therefore recommended to avoid seekable streams in custom IP cores that are intended for use with the Block Design Flow, in order to avoid confusion and a slight waste of FPGA logic.
Don’t make changes to the block named “stream_clk_gen” (the Clocking Wizard), except for changing its output frequency if necessary, as described in section 3.2.2.

Making changes in the input clock’s frequency, making other changes in the configuration, or removing it from the design and replacing it with a fresh Clocking Wizard IP block may lead to failing to meet the timing constraints (possibly because some timing constraints exceptions refer to the block’s name).

Setting an incorrect input frequency may lead to an unreliable behavior of the FPGA design.
It’s important to pay attention to how the clocks are connected. In particular, not not mixing between bus_clk and ap_clk.
Make sure that the Xillybus streams are asynchronous, which is the case in the default IP core and the autoset choice in custom IP cores when the stream’s intended use is “Data exchange with coprocessor”.

This causes, among others, write() function calls that are made on the host to return immediately if there is enough room for the data in the DMA buffers, ensuring a smoother data transport and higher bandwidth performance.

For a better understanding of this topic, please refer to section 2 of either Xillybus host application programming guide for Linux or Xillybus host application programming guide for Windows.