Electronic – Avalon-ST PCIe root port in an FPGA

fpgalinuxpcie

A buddy of mine and I have a project in which we have to implement a PCI Express rootport. I haven't read the entire specification, nor do I have the Mindshare book that everyone recommends, but I think I have a reasonable grasp on the subject.

We're using an Arria 10 FPGA and we're running Quartus 18.1 on a Linux PC.

We'd like to start of quite simple. Something like this example design: https://fpgawiki.intel.com/uploads/e/e9/Basic_rootport_demo.pdf

This is my understanding of how it works. Correct me if I'm wrong:

The configuration space resides inside the PCIe IP. Memory space and optionally IO space reside inside the computer, which is connected via JTAG. First the rootport maps the bus structure. It does this by first sending a cfg0 write, which sets the primary bus number (root port IP), secondary bus number (any device connected to the IP) and subordinate bus number (highest bus number that's directly or indirectly connected to the IP). These number are used for TLPs that are routed by ID (like any cfg TLP).

It then sends a cfg1 write to the device ID that's given by the secondary bus number. The device number and function number can be assumed to be zero.

This brings me to my first question: How does one know if a device has multiple functions?

This cfg1 write tlp writes the PCIe base address of BAR0 into the configuration space of the endpoint. Any address routed TLP that has the same base address, but might have a different offset gets routed to BAR0.

Wouldn't the rootport first write all 1's to this entry of the configuration space and read it back to determine the size it needs to allocate for this BAR?

The rootport then writes to the command register in the configuration space of the endpoint, to configure the endpoint to enable the capabilities we want. In this case it enables the memory space enable bit and the bus master bit.

Setting the base address of the BAR and the command register could have been done in a single transaction right?

The rootport then writes the base address of its BAR0 into its own configuration space, after which it also writes to the command register of its own configuration space also enabling the memory space and the bus master bit.

Again, this could have been done in one transaction, right?

After this, the rootport configures the endpoint's BARs. Again, it would normally do this by writing all 1s and read back the value to determine the size it needs to allocate for each address range, right?

All in all it seems like a great example design. However, since its been created in an older version of Quartus, I can't build it. I was wondering if you knew what Avalon-ST IPs I could use that are in the current version of Quartus to rebuild this design?

How exactly is rx_st_bar used? It seems the PCIe IP analyzes the TLPs and sets a bit based on where it needs to go. For an endpoint it seems pretty straightforward since the TLPs can only be addressed to either one of the BARs, an expansion ROM or the rest of a device's memory space.

For a rootport it seems a little more complicated however, since there are also bits for the primary, secondary and subordinate bus number window as well as an IO, non-prefetchable and prefetchable window. It is my understanding that these last 3 windows are used to map all PCIe addresses inside the rootport. I wonder how exactly this is done, since it seems received TLPs are involved. Those bus number windows are also vague to me. Or are these used for PCIe switches to determine if a TLP is meant for the rootport or a device on another PCIe link?

I think these are all the questions I have for now. Your help is greatly appreciated.

Best Answer

The configuration space resides inside the PCIe IP. Memory space and optionally IO space reside inside the computer, which is connected via JTAG.

If you're building a root port, then there usually isn't anything in the IP. It's mainly just going to pass requests through from one side to the other. How configuration space is accessed will be vendor-dependent - looks like Altera uses config type 0 to access the core config space directly, and then the core will convert type 1 to type 0 to access the directly connected device based on the base/limit registers, while Xilinx uses a totally separate interface to access configuration space and passes config type 0 requests directly through to the downstream device. Many of the config space registers don't really affect the operation of the downward-facing ports, anyway. You'll have to handle all IO and memory requests somehow (presumably against off-FPGA DRAM or similar). You'll also have to provide some method for higher-level software to access the configuration space of downstream devices and configure them appropriately, as well as provide a bridge from the CPU address space to PCI express operations so that system software can perform reads and writes on PCIe devices. JTAG will not be involved at any point.

Edit: looking at the presentation you linked, it looks like they're simply driving TLPs into the cores via JTAG. Fair enough, looks like they're basically just going to emulate the whole host side in software.

First the rootport maps the bus structure. It does this by first sending a cfg0 write, which sets the primary bus number (root port IP), secondary bus number (any device connected to the IP) and subordinate bus number (highest bus number that's directly or indirectly connected to the IP). These number are used for TLPs that are routed by ID (like any cfg TLP).

That's more or less right, except the root port will send either type 0 or type 1 configuration requests depending on where the target device is located within the PCIe topology. Switches convert type 1 to type 0 for directly connected devices, and devices that are not switches ignore type 1 configuration requests. So the control software will first send configuration type 0 reads through the root port to figure out what's connected. If it's a switch, then more config type 0 packets will be sent to set up the switch registers. Primary bus will be set to the next available bus number (probably 1), secondary to that plus 1, and subordinate to 255 (you don't know how many buses are below the switch yet). Then the control software will send configuration type 1 packets to the switch downstream ports one at a time and recursively enumerate and configure the switches and devices it finds.

It then sends a cfg1 write to the device ID that's given by the secondary bus number. The device number and function number can be assumed to be zero.

It will probably start with a read of the vendor ID of that device, but yes.

This brings me to my first question: How does one know if a device has multiple functions?

Configuration software accesses the function 0 configuration space and reads the header type field. If bit 7 is clear, there is only one function. If it is set, then configuration software will attempt to read from all possible functions. These reads will fail if the functions are not present.

This cfg1 write tlp writes the PCIe base address of BAR0 into the configuration space of the endpoint. Any address routed TLP that has the same base address, but might have a different offset gets routed to BAR0.

The configuration of the switches determines the routing. The BAR settings simply allow the device to figure out which device, function, and BAR that a read or write is targeting. Configuration software must make sure that the assigned BARs do not overlap.

Wouldn't the rootport first write all 1's to this entry of the configuration space and read it back to determine the size it needs to allocate for this BAR?

Exactly.

The rootport then writes to the command register in the configuration space of the endpoint, to configure the endpoint to enable the capabilities we want. In this case it enables the memory space enable bit and the bus master bit.

That's right.

Setting the base address of the BAR and the command register could have been done in a single transaction right?

No, because configuration reads and writes can only operate on one DWORD at a time, so you would need at bare minimum one config write TLP for each BAR and one for the command register.

The rootport then writes the base address of its BAR0 into its own configuration space, after which it also writes to the command register of its own configuration space also enabling the memory space and the bus master bit.

Root ports and switch ports don't usually implement BARs. However, you will have to set the base and limit registers to cover all allocated downstream address space so that address-routed TLPs will be routed correctly. Note that this means you basically have to allocate addresses in one shot in depth-first order as you can't just insert a big block somewhere after the fact without reallocating all of the subsequent addresses. It is possible to reserve addresses, bus numbers, etc. if you know beforehand what you might need for, say, hotplugging a device.

Again, this could have been done in one transaction, right?

Again, no, because config reads and writes are DWORD sized.

After this, the rootport configures the endpoint's BARs. Again, it would normally do this by writing all 1s and read back the value to determine the size it needs to allocate for each address range, right?

Right.

All in all it seems like a great example design. However, since its been created in an older version of Quartus, I can't build it. I was wondering if you knew what Avalon-ST IPs I could use that are in the current version of Quartus to rebuild this design?

Well, you could always download the correct version of quartus. Unfortunately, I'm not super familiar with Altera IP cores. There is probably only one PCIe IP core for your device, so check the documentation.

How exactly is rx_st_bar used? It seems the PCIe IP analyzes the TLPs and sets a bit based on where it needs to go. For an endpoint it seems pretty straightforward since the TLPs can only be addressed to either one of the BARs, an expansion ROM or the rest of a device's memory space.

That's correct.

For a rootport it seems a little more complicated however, since there are also bits for the primary, secondary and subordinate bus number window as well as an IO, non-prefetchable and prefetchable window. It is my understanding that these last 3 windows are used to map all PCIe addresses inside the rootport. I wonder how exactly this is done, since it seems received TLPs are involved. Those bus number windows are also vague to me. Or are these used for PCIe switches to determine if a TLP is meant for the rootport or a device on another PCIe link?

The root port generally doesn't have any BARs so rx_st_bar generally won't be used. It doesn't apply for TLPs that are transiting the root port, anyway. The base and limit registers and bus number registers are used to determine how to route TLPs. A TLP received by a PCIe-PCIe bridge (for instance, a switch port) will look at those settings to determine whether to route a packet from upstream to downstream or from downstream to upstream. Inside of a PCIe switch, there will be a single upstream port and multiple downstream ports, connected together with a "bus" that gets assigned a bus number. TLPs coming in to a downstream port can be headed either to the root port or back out of a different downstream port, and the base/limit and bus number settings are used to make this determination.