Exploring the Arrow SoCKit Part VII - Software Control for the FPGA MD5 Cracker
So now, finally, it’s time to connect our MD5 units to the HPS so that we can
get them to do something useful.
Grouping Several MD5 Units
First, we’ll want to instantiate multiple MD5 units and multiplex their
inputs and outputs. We’ll instantiate 16 md5 units here for a total of 32
compute nodes. The choice of number of nodes is pretty arbitrary. In our case
it’s convenient since the lightweight HPS-to-FPGA bridge is 32-bit, so we can
use a single word each for the start, reset, and done signals.
Using a for-generate statement saves us a lot of typing here.
Avalon Interfaces
Now we’ll need some Avalon slave interfaces. First, we’ll make one interface
for the control and status signals: reset, start, and done.
Notice that, unless we are writing to start or reset, we set the signals
back to 0 after every cycle. This ensures that the signals are only high for
a single cycle as expected.
We also need Avalon slaves for the input and output signals in md5group.
These slaves aren’t very interesting, since they pretty much just pass the
signals through, so I won’t show them here. You can look at them on
Github if
you’re really interested.
Qsys
If you’re using the code on Github,
you can load the soc_system.qsys file once you open up Qsys and click
generate. I basically just created the peripherals with the avs_* signals
connected to the Avalon interface and md5_* signal exported to conduits.
You can click generate in Qsys and a .qip file will be generated at
“soc_system/synthesis/soc_system.qip” as before.
At this point, I ran into a really weird bug with Qsys. No matter what I did,
it refused to export the md5_write signal from the md5input peripheral.
I ended up having to edit it manually to make it work. Since this is kind of
tedious to do every time, I wrote a script that will fix the generated
“soc_system.v” file automatically. If you’re using the code from Github,
just run the “fix_generated_system.sh” script.
At this point, you can run Analysis and Synthesis, followed by the memory pin
placement TCL script that was generated by Qsys. After that’s done, you can
compile the programming file for the FPGA. You don’t have to run the full
compilation to get the .sof file, you only have to run up to the
“Assembler” stage. You can do this by just double clicking on “Assembler”
in the “Tasks” window at the center left. Be prepared to wait a while.
Quartus’s fitter can take a really long time for larger designs.
Once the assembler stage has finished, you can convert the .sof file to a
raw binary file if you so choose.
The Software
The software we run on the board will have to generate a bunch of inputs,
pad them appropriately, and then load everything into the FPGA registers.
We’ll just use the “mmap /dev/mem” technique again to access the bridge memory.
We’ll use a struct to keep track of all the pointers and files we’ll need.
Then we can create functions for initializing and cleaning up the struct.
You might want to change the offset constants if your offsets are different.
We’ll want convenience functions for doing five key operations: resetting an
MD5 unit, starting a unit, checking if a unit is done, copying the input in,
and copying the output back. These definitions are pretty simple.
Now we’ll need code to generate the byte sequences we’ll use as our inputs.
For our purposes, let’s generate all the alpha-numeric strings that can fit
in a single 512-bit MD5 block.
The next_sequence function copies the next string in the sequence to the
pointer given and returns the length of the string or 0 if all of the strings
have been generated. Now that we have a way to generate inputs and control the
FPGA, we can write the main loop of our program.
Basically, the main loop will go through the 32 units in a round. At each
iteration, it waits for the unit to finish, copies data back from it if
necessary, resets the unit, copies the next sequence to it, and then starts
the unit again. This way, we can keep all 32 units busy.
Results
Using this 32-unit MD5 cracker, we can compute hashes on the SoCKit at about
80 thousand hashes per second. This is actually not that great. A single 2.7
GHz Intel Xeon core on my Macbook air can run at 10s of millions of hashes
per second. A high end GPU running OCL hashcat can run the algorithm at 23
billion hashes per second. There are probably ways to speed up our computation,
perhaps by instantiating more MD5 units or moving our sequence generation to
the FPGA, but it’s unlikely that the raw performance would catch up to that
of the Xeon.
Conclusion
So now you have seen how to implement moderately complex algorithms on the FPGA.
You have also seen that mid-range FPGAs like the Cyclone V aren’t actually
that great when it comes to raw performance. Where FPGAs really shine is in
strictly timed IO and real-time processing. We’ll explore the use of FPGAs in
these sorts of applications in the next few posts, as we look at implementing
real-time digital audio effects on the FPGA.