The guide to defining a custom Xillybus IP core

3 Scalability and logic resource consumption

3.1 General

Xillybus was designed with scalability in mind. While it makes perfect sense to configure a custom IP core for as little as a single stream, scaling up to a large number of streams has a relatively small impact on the amount of logic consumed by the Xillybus core.

In order to measure the consumption of logic, successive builds of the Xillybus IP core (baseline, revision A) were made with an increasing number of streams. In all tests, the number of streams from the FPGA to the host were the same as in the other direction. The number of streams ranged from 2 (one in each direction) to 64 (32 in each direction).

This section outlines the consumption of logic by the IP core itself on three families of FPGAs, as reported by their tools. These FPGAs (by AMD) are quite outdated, however similar results are achieved on more recent FPGAs, by AMD and Altera alike.

For a similar analysis of IP cores with revisions B, XL and XXL, see section 4.

XillyUSB is not covered in any of these analyses.

3.2 Block RAMs

The number of block RAMs used by the Xillybus core varies between zero to a few of them (3 block RAMs for 64 streams). There are no buffers inside the Xillybus core for each stream. Rather, the Xillybus core relies on the FIFOs connected to it to collect the data. Internally, the core has a single pool of memory that is used by all streams.

As the number of streams grow, block RAMs are used for storing the addresses of DMA buffers.

Additional block RAMs are allocated for DMA acceleration for streams from the host to the FPGA, as detailed in the core’s README file.

3.3 Resources of logic fabric

The graphs below show the consumption of LUTs and registers (flip-flops) as the number of streams go from 2 to 64. Each dot in those graphs is the de-facto use as shown in the synthesis report. What is evident from these graphs is the nearly linear growth in logic consumption. Regardless of the FPGA architecture, each stream adds about 110 LUTs and 82 registers on the average.

The number of slices actually consumed on the FPGA depends on how the elements of logic are packed into them. On the Spartan-6 or Virtex-6 families, each slice can contain up to 8 LUTs and 8 registers. Accordingly, a very optimistic approach would be to assume that the registers are packed perfectly, so each stream adds only 110/8 = 14 slices to the consumed resources. On the other hand, packing with half that efficiency is something achievable with no considerable effort. So the expected cost in slices for a stream can be estimated in the range of 14-28 slices.

It’s important to note that when the FPGA isn’t nearly full, the tools that carry out the implementation don’t bother to pack logic into slices efficiently, so the increase in the number of slices can be significantly steeper. In this case, the tools waste resources because there’s plenty of them.

The chosen setting for the benchmark test was 50% for upstreams and the same for downstreams. Real-life IP cores usually have an emphasis on one of the directions, but the results below give an idea of what to expect.

The graphs follow. The slope may appear steep, but note that the number of streams goes from a minimal IP core (2 streams) to a rather heavy one (64 streams).

The bottom line is that it makes sense to allocate extra streams in the IP core, even for the most trivial tasks, since their contribution in the number of slices is fairly low.