Exploring the Arrow SoCKit Part V - Computing MD5 Checksums on the FPGA
In parts I-IV of this tutorial series, we looked at how to create a simple flashing LED module on the FPGA and control it via software. Now, we will see how to use the FPGA to perform computation. To do this, we will take a non-trivial algorithm and implement it in Verilog. I have chosen to use the MD5 checksum algorithm as the example, since it is relatively easy to implement in hardware. Once we have a hardware unit that can compute MD5 sums, we can instantiate multiple instances of it and hook them up to the CPU to perform brute-force MD5 hash reversal.
I’ll be breaking up the development of the MD5 cracker across several blog posts, just as I did with the flashing LED example. This post will focus on implementing the algorithm. Later posts will discuss verification through simulation and control via software.
The Algorithm
Here is the pseudocode for the MD5 algorithm, taken from Wikipedia.
The first decision to be made is what parts of the algorithm will be implemented on the FPGA and what will be done in the control code. In general, computations with simple control flow that are repeated over and over again are a good fit for the FPGA. It’s also difficult to get FPGAs to handle data that can be variable in size. Therefore, we will implement the main loop, which processes a 512-bit chunk of the input, in Verilog. The CPU control code can then take care of appropriately padding the input and feeding it in chunk-by-chunk to the FPGA.
Now that we have decided what to implement, we must figure out how to break the computation up into blocks, which can then be implemented as Verilog separate modules.
First, it’s clear that we will need ROMs for K
and s
, as well as a RAM
for the input M
. We will also need a module to compute F
, a unit to compute
g
, a left rotater, a 32-bit adder, and finally a state machine to make sure
the computatation proceeds in the correct order.
Implementing Memories
The Cyclone V can efficiently implement small memories using on-chip block RAM. In order to generate ROMs and RAMs that use the block RAM, we will use the Altera MegaWizard tool.
To open MegaWizard, go to “Tools” -> “MegaWizard Plug-In Manager”. This will allow you to pick from a list of “megafunctions”, which are configurable hardware IP blocks that can be added to a design. Right now, we are interested in the megafunctions in the “Memory Compiler” section.
FPGA block RAM can have two separate sets of address and data signals.
Since we will be instantiating several many copies of our MD5 unit, it makes
sense to have them share memory as much as possible, so we will use the 2-port
ROM for K
and s
. In the first options page, select “64” for the number of
words in the memory. Then choose the correct word size. This should be 32 for
K
and 5 for s
. For “What should the memory block type be”, choose “M10K”.
You can skip the section on clocks, since the defaults there are fine.
On the third page, it will ask you which ports you want to put registers on.
The registers on the inputs are required, but the registers on the output ports
are optional. Each set of registers adds a cycle of delay, so deselect placing
registers on the outputs so that the ROM has only a single cycle delay.
On the fourth page, it will ask you to provide a file which specifies what
values will be in the ROM. You can choose to provide either a
Memory Initialization File
(.mif) or an Intel hex file
(.hex). The former is more human-readable, while the latter can be more easily
computer-generated. You can find the .mif files for the K ROM
and s ROM
in the Github repository. After you’ve finished memory intialization, you can
click “Finish” to generate the Verilog modules for the memory.
You can follow the same procedure to generate the RAM for M
. Choose the
2-port RAM in the “Memory Compiler” section. Create a RAM with one read port
and one write port. The wizard for RAMs has an extra section asking you what
should happen if you try to read from an address that is being written.
Choose “I do not care” for this. Also, for RAMs, you do not have to specify
a memory initialization file.
Combinational Units
Now that we have our memories, we can look at the combinational logic portions
of our circuit. First, let’s consider the computation for the F
variable.
The F
signal is generated from bitwise combinations of B
, C
, and D
.
However, the particular operation depends on the value of the 6-bit loop
counter i
. You will notice, however, that the only information we need is
whether it is in one of four equally sized ranges. We can determine this by
examining only the top two bits of i
.
The &
, |
, ^
, and ~
operators are bitwise AND, OR, XOR, and NOT
operations, respectively.
To compute g
, we have to mutltiply i
by certain constants, add it to
certain constants, and then take the result modulo 16. We can simplify this
by noting that taking the modulus of a binary integer by 16 is the same as
taking the lowest four bits. We can also simplify our logic based on the
fact that
a * i + b == a * (i % 16) + b (mod 16)
That is, multiplying and adding based on the last four bits of i
and then
taking the last four bits of the result is the same as multiplying and adding
by all six bits of i
.
The final trick we can use is to get rid of multiplication, which is an expensive logic operation. What you will notice is that the constants we are multiplying by (3, 5, and 7) are all one away from a power of two (2, 4, and 8). This is very convenient, since we can easily multiply by powers of two by left shifting bits. We can then add or subtract the original number from the shifted number to get the final result of the “multiplication”. Therefore, we can express our multiplications using the following equivalents.
3 * i = 2 * i + i = i << 1 + i
5 * i = 4 * i + i = i << 2 + i
7 * i = 8 * i - i = i << 3 - i
The amount by which we shift i
and the number we add to it will be selected
based on the two highest bits of i
, just as we did with the F
calculator.
Then there is the left rotater, which is fairly straightforward. To express a left rotation in verilog, you can double the input, left shift it, and then take the top half of the bits.
For the adder, we will simply use the verilog +
operator. Now that we have
our combinational units, we will need to tie them together to perform the
inner loop of the computation.
Sequential Computation
Stateless combinational logic is not enough to perform the computation we want. Therefore, we will need a state machine to store intermediate results in registers and select the correct inputs for the combinational units. In the body of the inner loop of the MD5 algorithm, there are a lot of 32-bit additions. However, we would like to reduce the number of 32-bit adders we use so as not to use too many logic elements. Fortunately, through clever sequencing, there is a way to do the computation in just four cycles using only one 32-bit adder.
Step 0
calculate F
calculate g
t0 = A + K[i]
Step 1
t1 = F + M[g]
Step 2
t0 = t0 + t1
Step 3
A = D
B = B + leftrotate(t0, s[i])
C = B
D = C
i = i + 1
These four steps will have to be repeated 64 times, with i incrementing each time. Then, at the end, we will need four more cycles to add A, B, C, and D onto a0, b0, c0, and d0.
This chunk_cruncher
module performs the computation in the inner loop of
the MD5 algorithm. It has address and data ports for the ROMs and RAMs,
as well as reset and start signals. The reset signal will tell the unit to
begin a new MD5 calculation, and the start signal will tell the unit to process
a new 512-bit chunk.
The first always
block is combinational and simply multiplexes the inputs
of the 32-bit adder based on the current step and stage. The second always
block is our state machine.
Note that, oddly enough, we do i = i + 1
in step 2 instead of step 3.
This is because there are registers on the address inputs of our ROMs.
We change the value of i
in step 2. In step 3, the value of s
will be based
on the old value of i
. Then, when we wrap around back to step 0, the value
of k
will be based on the new value of i
.
Putting Memory and Computation Together
Now that we have our memory and computational blocks, we’ll have to put them together in order to make them do something useful. Since our ROMs are dual ported, we’ll put together two “chunk_cruncher” modules, an sdata ROM, a kdata ROM, and two mdata RAMs.
Conclusion
So now we’ve built are MD5 unit. But how do we make sure it works? In my next post, I’ll discuss verification using the ModelSim circuit simulator.