Xillyp2p status ports and topics related to the physical link

Introduction

Xillyp2p is designed to maintain robust and reliable communication streams between two FPGAs, even if the physical link experiences occasional instability or signal degradation. This means, among others, that problems with the physical link may be virtually invisible to the user application, in particular with a bidirectional physical link: Retransmissions of data that fails to arrive correctly merely cause a reduction in the bandwidth efficiency, something that may go unnoticed.

It's therefore recommended to continuously monitor the status signals that are available through Xillyp2p's output ports, and to expose these to the human operator in a way that encourages taking action to resolve any issues with the physical link, if such occur. The design example for Ultrascale (or for series-7 FPGAs) shows, among others, a suggestion for how to connect status signals to LEDs in a way that exposes events as brief as a single clock (bit errors in particular).

These status signals can also help considerably during the FPGA development process, as they offer some information about the reasons for possible problems. In particular, the status_link_partner_mismatch output port should be made visible (possible with a LED), as it indicates if the link partner is rejected because core A and core B aren't a matching pair (i.e. they belong to different IP cores at the IP Core Factory), something that can easily happen while working on an FPGA project.

Also, refer to the section at the end of this guide for suggestions on how to check if an MGT or SERDES works properly.

Regardless of monitoring status signals, it's also recommended to test the application's response to error conditions by deliberately injecting bit errors during the FPGA project development. The error_test_rate input port, mentioned on the guide about Xillyp2p's ports and API, is intended for this purpose.

Outline of main status output ports

status_link_down: When this output port is high, the communication link is not ready to transport data. The reason can be identified from the other status signals. Note that this output is intended as an indication for problems, and not as a real-time indication for whether data can be sent or arrive: Payload data may be transmitted and received even if this output is high due to the delayed response of this output. When this output is low, it serves as the primary indication that the link is operational.
status_link_partner_mismatch: When this output port is high, the Xillyp2p IP core on the other side belongs to another pair of IP cores. When this occurs, required to replace the bitstream in the FPGA on the other side, so that the updated bitstream is based upon the correct Xillyp2p IP core.
status_initializing: This output port is present only when the physical link is bidirectional. When this port is high, it indicates that the bidirectional protocol is not initialized yet. Other status signals indicate the reason:
- If this output is high along with status_link_partner_mismatch, it's because the Xillyp2p IP cores on both sides don't match. See above.
- If this output is high along with bit 0 of status_debug being high, the arriving data stream on the physical link is either disconnected or severely malformed (for example, due to a misconfiguration of the MGT or SERDES, or lack of synchronization of the receiver).
- If this output becomes high occasionally, in particular while data is transmitted at a high rate, the reason can be that the frequencies of the clocks at both sides are too different. In other words, the frequency tolerance between the clocks is larger than was specified when the Xillyp2p IP cores were generated.
- Except for these three reasons, the link should initialize quickly and there should be no reason to reinitialize. Hence if the three reasons above are eliminated, it's recommended to examine if there is a reason for the FPGA to fail in general, e.g. incorrect timing constraints, timing constraints that weren't achieved, unstable clocks, problems with voltage supplies etc.
status_bit_error: This output is high during one clock cycle in response to a bit error on the physical data link. Sometimes, a single error on the physical link causes status_bit_error to become high several times. Hence this output should not be used in order to measure the bit error rate (BER). Rather, this output can be used as an indication for whether the link is practically error-free or not: When this output is constantly low, the receiver has synchronized on the arriving data link, and there are no bit errors (all bit errors are detected, even when no application data is transmitted). When the receiver fails to synchronize with the arriving data link (i.e. status_debug[0] is high), this output is held high.
status_rev_polarity: This output is high when the arriving data stream has reversed polarity, i.e. when all bits on the physical link are inverted. This is not an error output, but rather an informative status signal. The Xillyp2p IP cores are indifferent to reversed polarity, so there is no practical advantage to correcting the polarity.
status_debug: This is a 32-bit vector that contains additional information that may be useful when troubleshooting a faulty communication link. These signals are detailed below.

Applications that are based upon anything other than an MGT might need to fine-tune the sampling position of the arriving bit stream. status_link_down and status_bit_error can be used for this purpose, as explained in a separate section below.

Bits in status_debug

Bit 0: When this output is high, the receiver isn't synchronized with the physical link's stream of arriving data. More precisely, when this output is low, it indicates that the receiver's internal scrambler is synchronized with the transmitter's scrambler. It usually takes no more than 2²³ bits from the moment the xillyp2p core is supplied with a valid stream of data until this output becomes low. If the data stream is disconnected, this output usually becomes high about 2²² bits later. If the physical link's receiver adds or removes a bit to the data stream (an accidental bit slip due to poor clock synchronization, for example), this leads to a brief loss of synchronization, that lasts for around 2²³ bits. Note that data can arrive at the user application FIFOs even before this output becomes low.
Bit 2: This bit becomes high during one clock cycle when the Xillyp2p IP core has been requested to retransmit data as a result of a bit error on the physical link.
Bit 3: This bit becomes high during one clock cycle when the Xillyp2p IP core has detected an error in the arriving data, and therefore requests the other side to retransmit data.

Bits 2 and 3 indicate a correct response to a bit error. When these become high occasionally, everything continues to work properly, however the data rate performance may be measured lower due to retransmissions. When the physical link is unidirectional, they are never high, as retransmissions aren't possible.

Bits [31:4] contain advanced diagnostic information for unexpected scenarios. None of these should ever be high. If that happens nevertheless, it's recommended to check if there is a reason for the FPGA to fail in general (as mentioned above, incorrect timing constraints, timing constraints that weren't achieved, unstable clocks, problems with voltage supplies etc). The meaning of these bits relates to the details of Xillyp2p's protocol, and are therefore meaningless to the end-user. However, it may be helpful to point out which bit becomes high in requests for support.

Getting the physical link working

Setting up the physical link is a challenging task for even the most experienced FPGA engineer. It may require some trial and error before this link fulfills its simple task: To transfer a series of parallel words across the physical medium with almost no errors. This is true in particular when an MGT is used.

Xillyp2p's status output ports can supply useful information during the process. It's therefore recommended to connect the physical link to the Xillyp2p IP core for the initial attempt with the FPGA project. If it works right from the start, the advantage is obvious, but even if it doesn't, the IP core's diagnostic signals can offer some guidance.

Even if the link works partly, Xillyp2p's status signals can be useful. In particular, status_debug[0] shows whether a synchronization of the scrambler has been achieved. This is in particular valuable with a bidirectional physical link, because sometimes there is a problem in only one direction. In this case, status_debug[0] is high on one side, and low on the other. This gives an immediate hint on which direction to work on.

If status_debug[0] is low on both sides, it's possible to get an insight on the situation by transmitting a fixed word. For example, if an MGT with a parallel word of 32 bits is used, assign 32'hff00f055 to the MGT's data input port instead of the Xillyp2p IP core's out_data. Then obtain the value of the data output on the MGT on the other side, and evaluate how well the physical link works.

For obtaining the data, a Xillybus IP core can be used if the board has a PCIe interface, or any other method to capture data from within the FPGA. Alternatively, an integrated logic analyzer (ILA) can be used for this purpose.

Looking at the received parallel word, these are a few typical situations and what they may mean:

If the same parallel word is repeated, check if this is the transmitted word, only with a bit rotation. If this is indeed the case, the link works properly (in the examined direction) and Xillyp2p's status_debug[0] output should be low. This is the case, for example, if the transmitted word is 32'hff00f055 and the received word is 32h'157fc03c, because the latter is the transmitted word rotated right by 10 bits. Recall that the MGT doesn't and can't adjust the word alignment, as its encoding features are turned off.
If the parallel word is repeated consistently, however the word is incorrect:
- Try negating all bits of the received word (XOR with 32h'ffffffff, for example). If the result is the transmitted word, rotated by any number of bits, this is OK and Xillyp2p's status_debug[0] output should be low. Recall that Xillyp2p works even if the bit polarity is flipped (for all bits or none, of course).
- Otherwise, search for similarities with the transmitted word. If the received word is partly damaged, there might be some issue with the analog signal that carries the data stream. For example, the termination voltage at the receiver can be incorrect, the termination resistance may be incorrect, or absent where it should be present and vice versa. Also, coupling capacitors may be missing or present where they shouldn't be.
If the received data is repeated a few times, and then another word is repeated, and it goes on like that: This indicates that the receiver's clock recovery mechanism doesn't synchronize with the transmitter's clock. This can be due to problems with the clocks themselves (frequency differences too large or too much clock jitter). Another possibility is that the analog signal isn't received properly, for the same reasons as mentioned above in relation to a repeated word that arrives incorrectly.
If the received data is completely random, odds are that there is no connection at all. For example, incorrect pin placement, an incorrect reference clock or incorrect / lacking power supply voltage to the MGT circuitry.

The constant parallel word 32'hff00f055 was given as an example, but a different word needs to be chosen if the parallel word's width is different (and it might be beneficial to choose a different word regardless). When choosing such parallel word, it's worth noting these three points:

Ideally, the word is DC-balanced, i.e. has the same number of 0's and 1's. The analog circuitry may not work as well otherwise.
The word should have a few fast changes from 0 to 1 and back. In the example, bits 7:0, i.e. the 0x55 part, serves this purpose. These transitions help the clock-recovery circuitry to lock accurately.
The word should have some kind of randomness to it, in particular if a DFE equalizer is applied. The example with 32'hff00f055 isn't so good in this respect, as it has long sequences of 0's and 1's. These sequences are easier for a human to work with, but may provoke the equalizer to disrupt the analog signal to some extent. It's a tradeoff between convenience and the risk of causing and debugging a problem that exists only because a poor choice of a testing word.

These three rules above aren't carved in stone, and in particular, it's possible to see the channel working properly even if ignoring the last one.

Using signals for timing adjustments

When the physical link is implemented with a SERDES, or possibly a single bit that is sampled from an I/O port, it's often necessary for the application logic to find the optimal sampling time point while the system is running. This is usually the case when data is transmitted from one FPGA to another through a PCB trace, and both FPGAs share the same reference clock. In such scenario, there is no uncertainty regarding the bit clock's frequency, however the optimal phase of this clock, for the purpose of sampling the arriving bit sequence, needs to be found by sweeping through a range of possibilities. The logic performs this by adjusting the sampling clock's phase, and in some cases by adjusting the input pin's delay.

The logic for finding the optimal sampling point is beyond the scope of this guide. However, with the Xillyp2p IP core, there is no need to add logic for checking whether any given sampling point is good enough. Rather, the transmitters on both sides (or on one side for the unidirectional case) can be connected to Xillyp2p throughout the process. The IP core also remains connected on the receiving sides, and it provides an indication of the quality of the sampling point.

This is a suggested procedure for using the IP core's status_debug[0] and status_bit_error for finding the error-free sampling position:

After setting the phase shift or input pin delay, wait for the time that corresponds to the transmission of slightly more than 2²² bits on the physical link. This is the time it takes for status_debug[0] for become high if the physical link is unusable (even though this might happen later if the link's quality is close to the limit). As the point of this step is to ensure that status_debug[0] isn't low because the bad link quality hasn't been detected yet, it's fine to continue to the next step as soon as status_debug[0] is high.
If status_debug[0] is still high at this point, wait for another period of time, corresponding to 2²⁴–2²⁵ bits on the physical link, to give a chance for Xillyp2p's receiver to lock. As soon as status_debug[0] is low, continue to the next step (possibly immediately). However, if this output port remains high, the sampling point is really bad: Xillyp2p isn't able to even synchronize its scrambler.
Monitor status_bit_error for a period of time. If status_bit_error remains continuously low, this indicates that no bit error occurred, which is an indication for a good sampling position. How long to dwell on this step depends on different design considerations. For example, if this step runs for the time corresponding to 2²⁰ bits and no errors were detected, this very roughly corresponds to checking for a bit error rate of 10^-6. But since the previous stage already required 2²⁵ bits, it make sense to dwell at least the same number of bits on this stage.

It's common practice to sweep through a range of phase shifting options, and find a range of positions long enough where no errors are detected. After this sweep, the position in the middle of this range is used for normal operation. Hence it's not so important how long the bit error test lasts.

In fact, it may be enough to find the middle points between where status_debug[0] is low, even though this is probably a less accurate method.

Alternatively, it's also possible to rely on status_bit_error only, since this output port it held high when status_debug[0] is high. In other words, if Xillyp2p fails to synchronize its scrambler, all bits are considered errors. However, relying on status_bit_error alone requires ignoring this signal during 2²⁴–2²⁵ bits after shifting the bit sampling position, in order to ensure that the receiver has had a chance to synchronize. This slows down the procedure, compared with the possibility to begin monitoring status_bit_error as soon as slightly more than 2²² bits after the sampling position shift, given that status_debug[0] is low at that time.