aka programmable hardware, (re)configurable logic, etc

Hardware

Hardware description languages (HDL)

Subject: Repost: Performance Benchmarks: Emulating FPGAs Using
         General Purpose Processors
Date: 09 Feb 1996 00:00:00 GMT
newsgroups: comp.arch.fpga

Reposted after Netcom corrupted the first copy.

In <1996Feb9.082016.20892@super.org> sc-@vcc.com (Steve Casselman)
writes: 
>>Another thread that would be good is just for some benchmark
>>data on what people are doing now. For example if someone
>>is doing neural nets you might post:
>>
>>I did a 12 neuron neural net that did 12 Billion connections/sec
>>in 3000 gates and a ALPHA 330MegHz can do them @ 300 Million  
>>connections/sec.  (no I did not do this it is just an example 
>>of the detail of the proposed post)

A while back I was wondering about this in general: "how much better
are FPGAs at executing, in hardware, an arbitrary computation, than are
modern general purpose processors at executing, in software, the same
arbitrary computation?".  And so I wrote the following loose analysis
which might be interesting to discuss:


Emulating FPGAs using general purpose processors
Jan Gray, 1995

It is well known that FPGAs can implement general purpose computers,
and general purpose computers can emulate FPGAs.  In this message I
consider how efficiently a general purpose processor can emulate an
arbitrary FPGA circuit.  It arises from my curiosity regarding what
performance advantage (if any) the custom reconfigurable computing
crowd has over general purpose processors, to quantify, best case, how
much faster an FPGA is at arbitrary FPGA problems.


Setting aside such modern features as embedded SRAM, 3-state buses,
asynchronous flip-flop set/reset, and even flip-flop clock enables, we
model a typical lookup-table-based FPGA logic element as an arbitrary
4-input logic function driving either another stage of logic or a
flip-flop.  Without fudging too much we can then consider that all
multilevel logic functions are pipelined with a register between each
logic function.  This simplifies my equivalence construction below and
allows us to compare against an FPGA clock rate based upon the sum of
flip-flop clock-to-output delay, delay through some interconnect, one
logic function delay, and the flip-flop setup time.

Thus, we model an FPGA as a vector of n 4-input function generators
F[i] driving n flip-flops R[i] which are in turn fed back into the
function generator inputs. In practice most FPGAs are incapable of
modeling arbitrary interconnect schemes between logic elements, instead
greatly favouring local interconnect between nearby logic elements. 
However, we’ll be overly generous and permit any function generator
F[i] to compute any function of any 4 flip-flop outputs:
	R[i]’ = F[i](R[a[i]], R[b[i]], R[c[i]], R[d[i]]);
for i, a[i], b[i], c[i], d[i] all in [0..n).

Note that a, b, c, d, are simple mappings which describe which
flip-flop outputs are inputs to each given function generator.  For
instance, if R[0]’ = R[2] xor R[3] xor R[5] xor R[7], we would have
a[0]=2, b[0]=3, c[0]=5, d[0]=7.

On a general purpose processor with word width w, we can implement the
flip-flop vector by partitioning it into w bit registers.  To simplify
the presentation, assume that n == w.  Then a simulation of one
clocking of the FPGA is to form A, B, C, D, each a mapping of R such
that
	A[i] == R[a[i]]
	B[i] == R[b[i]]
	C[i] == R[c[i]]
	D[i] == R[d[i]]
and finally compute
	R’[i] = F[i](A[i], B[i], C[i], D[i])
word-wise, over all the bit positions i simultaneously.

The first sub-problem is to compute 4 mappings of bits of R into A, B,
C, D efficiently.  One approach is to perform wordwise bit shifting,
xor, and mask operations upon R.  For instance, R can be bit-reversed
in 5 lg w simple RISC instructions, arbitrarily permuted in 10 lg w
instructions, and arbitrary mapped (any bit input to any bit output,
including arbitrary bit replication) in roughly 20 lg w instructions
employing 4 lg w constants.  (Knuth developed this by analogy from
exchange networks).  Another approach is to build w/8 256-entry lookup
tables, for each byte of bits in R, and "or" the contributions
together:
    word R, LUA[w/8][256], A = 0;
    for (i = 0; i < w/8; i++)
	A |= LUA[i][(R>>(i*8)) % 256];
Written in-line this would require about w/8 shifts, masks, loads and
"or"s.  Let us assume the tables fit into d-cache and charge overall
4*w/8 == w/2 instructions to arbitrarily map R into A.

For example, for w==64, the wordwise bit mapping approach would require
approximately 120 instructions, whereas the table lookup approach would
require approximately 32 instructions (including 8 loads).  Thus the
four mappings A, B, C, D require about 4*32 == 128 instructions,
including 32 loads and about 8 KB of lookup tables.

The next sub-problem is to compute F(A,B,C,D).  To perform this in
parallel across all n bits, it suffices to generate the 16 minterms
M[0]=~A&~B&~C&~D, M[1]=~A&~B&~C&D, ..., M[15]=A&B&C&D and then combine
them under 16 masks N[0..15]:
	R = M[0]&N[0] | M[1]&N[1] | ... | M[15]&N[15];
By reusing common subexpressions this can be done in about 60
instructions.

All totaled, on a w-bit processor, we need about 4*max(w/2, 20 lg w) +
60 instructions to simulate one clocking of an arbitrary w-bit wide
FPGA, as modeled.  Also of interest, the state required to describe the
arbitrary w-bit FPGA of 4-input logic elements, is only about 16 + 4 lg
w words.


Let’s try out this model against some real hardware!  A lowly XC4002A-5
has 8x8 CLBs each with 2 function generators and two flip-flops, and
with a clock to flip-flop output delay of 3 ns, an "interconnect delay"
which I'll empirically and arbitrarily state is 17 ns, and a combined
function generator plus flip-flop setup time of 5 ns, all totaled,
let’s say 25 ns.  In summary, we won’t be far wrong (?) to say an ‘02A
is a "128-bit FPGA" which you can clock at 40 MHz.  In comparison a
32x32 CLB XC4025-5 would be a "2048-bit FPGA".

In the general purpose processor corner, let’s employ an Alpha 21164A. 
I recall this machine can issue 4 instructions per clock at 300 MHz. 
Since w==64, it would take an Alpha about 200 instructions or about 50
issue slots (167 ns) to emulate one clocking of an arbitrary 64-bit
FPGA.  Thus our Alpha might emulate half an ‘02A at a rate of about 6
MHz.

For another example, the BiCMOS MicroUnity media processor
implementation can apparently perform 128-bit operations at 1000 MHz. 
Here with w==128, it could emulate an arbitrary 128-bit FPGA at about 4
MHz, perhaps faster if we can take advantage of some of its special bit
shuffle instructions.


Well!  These results surprised me.   Even executing a billion
operations per second on 64- or 128-bit registers, these general
purpose processors couldn’t achieve 10% of the speed*gates figure of a
small FPGA.  Even if you consider my "12 ns" FPGA interconnect delay
assumption far too generous given the arbitrary FPGA’s arbitrary
interconnection topology, and derate it to 50 ns or more, the little
FPGA is still several times faster at FPGA type work than a general
purpose microprocessor.


(I also wonder if this presentation doesn’t suggest a slow, yet
compact, physical FPGA implementation.  Attach a vector of flip-flops
to the input and output sides of a "4-context" n-stage
shuffle/copy/exchange interconnect, set up to route its input values
according to 4 different routing configurations.  The interconnect can
be pipelined, and run at high speed, to map the flip-flop outputs into
4 registered function generator inputs and every 8 or 9 clocks the
flip-flop itself could be clocked.  Overall the pipeline could run at
200+ MHz and the overall FPGA at 25 MHz.)

Jan Gray
  • programmable_hardware.txt
  • Last modified: 2007-12-30 12:57
  • by nik