Electrical – Finding if a value is among an array using a FPGA

algorithmfpga

I've been doing a lot of software engineering, and find myself intrigued by the hardware implementation of things to the extent that I'd like to port one of my recent algorithms to an FPGA.
Trying to teach myself FPGAs, which seems easier than it used to, but still leaves me in the dark for usual patterns/solutions. So here's my problem:

Say I put many (~100) "cores" on an FPGA which compute possible matches. Each possible match is a 64 bit value, and let's assume computing said possible matches is expensive (~35k clock cycles).

Now once the matches have been computed, I want to see if any matches a precomputed list of roughly 15k possible 64 bits values. The probability that a possible match is found being very small (~1e-8).

What would/should be an design to compute whether possible matches are found in my original list? I can assume that my original list is fixed (and presumably sorted).

In software, I've been using a binary search, but I'm wondering about RAM access in that case. If I had 100 cores each feeding one possible match every 35k clock cycles, it means that (assuming I had some kind of buffer available) I could have a binary search component as long as it'd take less than 350 clock cycles to see whether something matches (and assuming 14 comparisons for a binary search among 15k elements, that looks just about doable).

So, any idea? Is the binary search the way to go (and should it be modified? Using a heap to help with caching of RAM lines, or k-ary search? Looking at hashmaps or bloom filters for faster discarding?) Is there any "standard" way to do this kind of thing?

Best Answer

Well, to store 15K 64-bit values you need a reasonable number (i.e. about 64) BlockRams which you can access in parallel. They typically have dual ports, so you can potentially perform 128 comparisons in parallel. Each BlockRam (16k bits) stores 256 64-bit values, and can expose 2 values per cycle, so even a linear search will only take 128 cycles.

You might dedicate two entries in each BlockRam to every 128th value from the list, and perform a binning comparison; if your match is between bins 17 and 18, then address location 17 within every BlockRam, then one further operation will test that entire bin. (Oops; as you lost one location out of 128 for the bin search, make that every 127th value...)

Consider your sample address in the range 0 to 16000 as N * 127 + M where:
N ranges from 0 to 127 (128 bins)
M ranges from 0 to 126 (127 values within each bin).

First comparison (bin comparison) accesses 128 values, stored in location 127 (and 255 on the second port) within each BlockRam. These values are list entries N*127 + 0 for all N in 0 to 127.
For some N, value(N) <= match < value(N+1) - thus bin N is the bin to search further. Note this requires a relational comparison

Now we address location N in every BlockRam and perform an equality comparison.
Each BlockRam contains (in its Nth address) value (N + M) for M from 0 to 126.
NB this means the last BlockRam is unused and we only perform 127 equality comparisons. If one of these M matches, compute N*127+M, otherwise signal failure.

Takes a little more than 2 cycles thanks to the indirection and BlockRam being synchronous memory, but probably fast enough for your purpose. I doubt a GPU can match 3 cycles...

I'm assuming it's cheap enough to pre-sort and interleave the list, probably in software on the host, before loading it into the BlockRams.

So the data is interleaved such that the Nth BlockRam contains

Address    Data   
0          List(0*127 + N)   
1          List(1*127 + N)   
2          List(2*127 + N)   
...   
126        List(126*127 + N)   
127        List(N*127 + 0) (the bin comparison value)