Exploring the Arrow SoCKit Part V - Computing MD5 Checksums on the FPGA

In parts I-IV of this tutorial series, we looked at how to create a simple flashing LED module on the FPGA and control it via software. Now, we will see how to use the FPGA to perform computation. To do this, we will take a non-trivial algorithm and implement it in Verilog. I have chosen to use the MD5 checksum algorithm as the example, since it is relatively easy to implement in hardware. Once we have a hardware unit that can compute MD5 sums, we can instantiate multiple instances of it and hook them up to the CPU to perform brute-force MD5 hash reversal.

I’ll be breaking up the development of the MD5 cracker across several blog posts, just as I did with the flashing LED example. This post will focus on implementing the algorithm. Later posts will discuss verification through simulation and control via software.

The Algorithm

Here is the pseudocode for the MD5 algorithm, taken from Wikipedia.

//Note: All variables are unsigned 32 bit and wrap modulo 2^32 when calculating
var int[64] s, K

//s specifies the per-round shift amounts
s[ 0..15] := { 7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22}
s[16..31] := { 5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20}
s[32..47] := { 4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23}
s[48..63] := { 6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21}

//Use binary integer part of the sines of integers (Radians) as constants:
for i from 0 to 63
    K[i] := floor(abs(sin(i + 1)) × (2 pow 32))
end for
//(Or just use the following table):
K[ 0.. 3] := { 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee }
K[ 4.. 7] := { 0xf57c0faf, 0x4787c62a, 0xa8304613, 0xfd469501 }
K[ 8..11] := { 0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be }
K[12..15] := { 0x6b901122, 0xfd987193, 0xa679438e, 0x49b40821 }
K[16..19] := { 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa }
K[20..23] := { 0xd62f105d, 0x02441453, 0xd8a1e681, 0xe7d3fbc8 }
K[24..27] := { 0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed }
K[28..31] := { 0xa9e3e905, 0xfcefa3f8, 0x676f02d9, 0x8d2a4c8a }
K[32..35] := { 0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c }
K[36..39] := { 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70 }
K[40..43] := { 0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05 }
K[44..47] := { 0xd9d4d039, 0xe6db99e5, 0x1fa27cf8, 0xc4ac5665 }
K[48..51] := { 0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039 }
K[52..55] := { 0x655b59c3, 0x8f0ccc92, 0xffeff47d, 0x85845dd1 }
K[56..59] := { 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1 }
K[60..63] := { 0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391 }

//Initialize variables:
var int a0 := 0x67452301   //A
var int b0 := 0xefcdab89   //B
var int c0 := 0x98badcfe   //C
var int d0 := 0x10325476   //D

//Pre-processing: adding a single 1 bit
append 1 bit to message
// Notice: the input bytes are considered as bit strings,
// where the first bit is the most significant bit of the byte.


//Pre-processing: padding with zeros
append 0 bit until message length in bit == 448 (mod 512)
append length mod (2 pow 64) to message


//Process the message in successive 512-bit chunks:
for each 512-bit chunk of message
    break chunk into sixteen 32-bit words M[j], 0 <= j <= 15
//Initialize hash value for this chunk:
    var int A := a0
    var int B := b0
    var int C := c0
    var int D := d0
//Main loop:
    for i from 0 to 63
        if 0 <= i <= 15 then
            F := (B and C) or ((not B) and D)
            g := i
        else if 16 <= i <= 31
            F := (D and B) or ((not D) and C)
            g := (5 * i + 1) mod 16
        else if 32 <= i <= 47
            F := B xor C xor D
            g := (3 * i + 5) mod 16
        else if 48 <= i <= 63
            F := C xor (B or (not D))
            g := (7 * i) mod 16
        dTemp := D
        D := C
        C := B
        B := B + leftrotate((A + F + K[i] + M[g]), s[i])
        A := dTemp
    end for
//Add this chunk's hash to result so far:
    a0 := a0 + A
    b0 := b0 + B
    c0 := c0 + C
    d0 := d0 + D
end for

var char digest[16] := a0 append b0 append c0 append d0 //(Output is in little-endian)

//leftrotate function definition
leftrotate (x, c)
    return (x << c) binary or (x >> (32-c));

The first decision to be made is what parts of the algorithm will be implemented on the FPGA and what will be done in the control code. In general, computations with simple control flow that are repeated over and over again are a good fit for the FPGA. It’s also difficult to get FPGAs to handle data that can be variable in size. Therefore, we will implement the main loop, which processes a 512-bit chunk of the input, in Verilog. The CPU control code can then take care of appropriately padding the input and feeding it in chunk-by-chunk to the FPGA.

Now that we have decided what to implement, we must figure out how to break the computation up into blocks, which can then be implemented as Verilog separate modules.

First, it’s clear that we will need ROMs for K and s, as well as a RAM for the input M. We will also need a module to compute F, a unit to compute g, a left rotater, a 32-bit adder, and finally a state machine to make sure the computatation proceeds in the correct order.

Implementing Memories

The Cyclone V can efficiently implement small memories using on-chip block RAM. In order to generate ROMs and RAMs that use the block RAM, we will use the Altera MegaWizard tool.

To open MegaWizard, go to “Tools” -> “MegaWizard Plug-In Manager”. This will allow you to pick from a list of “megafunctions”, which are configurable hardware IP blocks that can be added to a design. Right now, we are interested in the megafunctions in the “Memory Compiler” section.

FPGA block RAM can have two separate sets of address and data signals. Since we will be instantiating several many copies of our MD5 unit, it makes sense to have them share memory as much as possible, so we will use the 2-port ROM for K and s. In the first options page, select “64” for the number of words in the memory. Then choose the correct word size. This should be 32 for K and 5 for s. For “What should the memory block type be”, choose “M10K”. You can skip the section on clocks, since the defaults there are fine. On the third page, it will ask you which ports you want to put registers on. The registers on the inputs are required, but the registers on the output ports are optional. Each set of registers adds a cycle of delay, so deselect placing registers on the outputs so that the ROM has only a single cycle delay. On the fourth page, it will ask you to provide a file which specifies what values will be in the ROM. You can choose to provide either a Memory Initialization File (.mif) or an Intel hex file (.hex). The former is more human-readable, while the latter can be more easily computer-generated. You can find the .mif files for the K ROM and s ROM in the Github repository. After you’ve finished memory intialization, you can click “Finish” to generate the Verilog modules for the memory.

You can follow the same procedure to generate the RAM for M. Choose the 2-port RAM in the “Memory Compiler” section. Create a RAM with one read port and one write port. The wizard for RAMs has an extra section asking you what should happen if you try to read from an address that is being written. Choose “I do not care” for this. Also, for RAMs, you do not have to specify a memory initialization file.

Combinational Units

Now that we have our memories, we can look at the combinational logic portions of our circuit. First, let’s consider the computation for the F variable. The F signal is generated from bitwise combinations of B, C, and D. However, the particular operation depends on the value of the 6-bit loop counter i. You will notice, however, that the only information we need is whether it is in one of four equally sized ranges. We can determine this by examining only the top two bits of i.

module fcalc (
    input [1:0] sel,
    input [31:0] b,
    input [31:0] c,
    input [31:0] d,
    output reg [31:0] f
);

always @(*) begin
    case (sel)
        // 0 <= i <= 15
        2'b00: f <= (b & c) | (~b & d);
        // 16 <= i <= 31
        2'b01: f <= (d & b) | (~d & c);
        // 32 <= i <= 47
        2'b10: f <= b ^ c ^ d;
        // 48 <= i <= 63
        2'b11: f <= c ^ (b | ~d);
    endcase
end

endmodule

The &, |, ^, and ~ operators are bitwise AND, OR, XOR, and NOT operations, respectively.

To compute g, we have to mutltiply i by certain constants, add it to certain constants, and then take the result modulo 16. We can simplify this by noting that taking the modulus of a binary integer by 16 is the same as taking the lowest four bits. We can also simplify our logic based on the fact that

a * i + b == a * (i % 16) + b (mod 16)

That is, multiplying and adding based on the last four bits of i and then taking the last four bits of the result is the same as multiplying and adding by all six bits of i.

The final trick we can use is to get rid of multiplication, which is an expensive logic operation. What you will notice is that the constants we are multiplying by (3, 5, and 7) are all one away from a power of two (2, 4, and 8). This is very convenient, since we can easily multiply by powers of two by left shifting bits. We can then add or subtract the original number from the shifted number to get the final result of the “multiplication”. Therefore, we can express our multiplications using the following equivalents.

3 * i = 2 * i + i = i << 1 + i
5 * i = 4 * i + i = i << 2 + i
7 * i = 8 * i - i = i << 3 - i

The amount by which we shift i and the number we add to it will be selected based on the two highest bits of i, just as we did with the F calculator.

module gcalc (
    input  [5:0] i,
    output [3:0] g
);

reg doshift;
reg sub;
reg  [1:0] shiftby;
reg  [2:0] addon;

wire [3:0] shift_res = (doshift) ? i[3:0] << shiftby : 4'b0;
wire [3:0] mult_res = (sub) ? shift_res - i[3:0] : shift_res + i[3:0];
assign g = mult_res + addon;

always @(*) begin
    case (i[5:4])
        // 0 <= i <= 15
        2'b00: begin
            // g = i
            doshift <= 1'b0;
            sub <= 1'b0;
            shiftby <= 2'b0;
            addon <= 3'd0;
        end
        // 16 <= i <= 31
        2'b01: begin
            // g = 5 * i + 1
            doshift <= 1'b1;
            sub <= 1'b0;
            shiftby <= 2'b10;
            addon <= 3'd1;
        end
        // 32 <= i <= 47
        2'b10: begin
            // g = 3 * i + 5
            doshift <= 1'b1;
            sub <= 1'b0;
            shiftby <= 2'b01;
            addon <= 3'd5;
        end
        // 48 <= i <= 63
        2'b11: begin
            // g = 7 * i
            doshift <= 1'b1;
            sub <= 1'b1;
            shiftby <= 2'b11;
            addon <= 3'd0;
        end
    endcase
end

endmodule

Then there is the left rotater, which is fairly straightforward. To express a left rotation in verilog, you can double the input, left shift it, and then take the top half of the bits.

module leftrotate (
    input  [31:0] rotin,
    input  [4:0]  rotby,
    output [31:0] rotout
);

wire [63:0] rotmid = ({rotin, rotin} << rotby);
assign rotout = rotmid[63:32];

endmodule

For the adder, we will simply use the verilog + operator. Now that we have our combinational units, we will need to tie them together to perform the inner loop of the computation.

Sequential Computation

Stateless combinational logic is not enough to perform the computation we want. Therefore, we will need a state machine to store intermediate results in registers and select the correct inputs for the combinational units. In the body of the inner loop of the MD5 algorithm, there are a lot of 32-bit additions. However, we would like to reduce the number of 32-bit adders we use so as not to use too many logic elements. Fortunately, through clever sequencing, there is a way to do the computation in just four cycles using only one 32-bit adder.

Step 0
    calculate F
    calculate g
    t0 = A + K[i]
Step 1
    t1 = F + M[g]
Step 2
    t0 = t0 + t1
Step 3
    A = D
    B = B + leftrotate(t0, s[i])
    C = B
    D = C
    i = i + 1

These four steps will have to be repeated 64 times, with i incrementing each time. Then, at the end, we will need four more cycles to add A, B, C, and D onto a0, b0, c0, and d0.

module chunk_cruncher (
    input clk,
    input reset,
    input start,
    output done,

    output [127:0] digest,

    output [5:0]  iaddr,
    input  [31:0] kdata,
    input  [4:0]  sdata,

    output [3:0]  gaddr,
    input  [31:0] mdata
);

parameter INITA = 32'h67452301;
parameter INITB = 32'hefcdab89;
parameter INITC = 32'h98badcfe;
parameter INITD = 32'h10325476;

reg [31:0] a0;
reg [31:0] b0;
reg [31:0] c0;
reg [31:0] d0;

reg [31:0] areg;
reg [31:0] breg;
reg [31:0] creg;
reg [31:0] dreg;

assign digest = {d0, c0, b0, a0};

reg [5:0] ireg;

wire [31:0] f;
reg  [31:0] freg;

reg  [31:0] adda;
reg  [31:0] addb;
wire [31:0] adds = adda + addb;

reg [31:0] t0;
reg [31:0] t1;

wire [31:0] rotated;

wire [31:0] inext = ireg + 1'b1;
assign iaddr = ireg;

fcalc fc (
    .sel (ireg[5:4]),
    .b (breg),
    .c (creg),
    .d (dreg),
    .f (f)
);

gcalc gc (
    .i (ireg),
    .g (gaddr)
);

leftrotate lr (
    .rotin (t0),
    .rotby (sdata),
    .rotout (rotated)
);

parameter CRUNCH   = 2'b00;
parameter FINALIZE = 2'b01;
parameter FINISHED = 2'b10;

reg [1:0] stage;

assign done = (stage == FINISHED);

reg [1:0] step;

always @(*) begin
    if (stage == CRUNCH) begin
        case (step)
            2'b00: begin
                adda <= areg;
                addb <= kdata;
            end
            2'b01: begin
                adda <= freg;
                addb <= mdata;
            end
            2'b10: begin
                adda <= t0;
                addb <= t1;
            end
            2'b11: begin
                adda <= breg;
                addb <= rotated;
            end
        endcase
    end else if (stage == FINALIZE) begin
        case (step)
            2'b00: begin
                adda <= a0;
                addb <= areg;
            end
            2'b01: begin
                adda <= b0;
                addb <= breg;
            end
            2'b10: begin
                adda <= c0;
                addb <= creg;
            end
            2'b11: begin
                adda <= d0;
                addb <= dreg;
            end
        endcase
    end else begin
        adda <= 32'd0;
        addb <= 32'd0;
    end
end

always @(posedge clk) begin
    if (reset) begin
        a0 <= INITA;
        b0 <= INITB;
        c0 <= INITC;
        d0 <= INITD;
        ireg <= 6'd00;
        stage <= FINISHED;
    end else if (start) begin
        areg <= a0;
        breg <= b0;
        creg <= c0;
        dreg <= d0;
        step <= 2'b00;
        stage <= CRUNCH;
    end else if (stage == CRUNCH) begin
        case (step)
            2'b00: begin
                freg <= f;
                // t0 = areg + kdata
                t0 <= adds;
                step <= 2'b01;
            end
            2'b01: begin
                // t1 = freg + mdata
                t1 <= adds;
                step <= 2'b10;
            end
            2'b10: begin
                // t0 = t0 + t1
                t0 <= adds;
                step <= 2'b11;
                ireg <= inext;
            end
            2'b11: begin
                areg <= dreg;
                // breg = breg + rotate(t0, sdata)
                breg <= adds;
                creg <= breg;
                dreg <= creg;
                if (ireg != 6'd0) begin
                    step <= 2'b00;
                end else begin
                    step <= 2'b00;
                    stage <= FINALIZE;
                end
            end
        endcase
    end else if (stage == FINALIZE) begin
        case (step)
            2'b00: begin
                // a0 += areg
                a0 <= adds;
                step <= 2'b01;
            end
            2'b01: begin
                // b0 += breg
                b0 <= adds;
                step <= 2'b10;
            end
            2'b10: begin
                // c0 += creg
                c0 <= adds;
                step <= 2'b11;
            end
            2'b11: begin
                // d0 += dreg
                d0 <= adds;
                stage <= FINISHED;
            end
        endcase
    end
end

endmodule

This chunk_cruncher module performs the computation in the inner loop of the MD5 algorithm. It has address and data ports for the ROMs and RAMs, as well as reset and start signals. The reset signal will tell the unit to begin a new MD5 calculation, and the start signal will tell the unit to process a new 512-bit chunk.

The first always block is combinational and simply multiplexes the inputs of the 32-bit adder based on the current step and stage. The second always block is our state machine.

Note that, oddly enough, we do i = i + 1 in step 2 instead of step 3. This is because there are registers on the address inputs of our ROMs. We change the value of i in step 2. In step 3, the value of s will be based on the old value of i. Then, when we wrap around back to step 0, the value of k will be based on the new value of i.

Putting Memory and Computation Together

Now that we have our memory and computational blocks, we’ll have to put them together in order to make them do something useful. Since our ROMs are dual ported, we’ll put together two “chunk_cruncher” modules, an sdata ROM, a kdata ROM, and two mdata RAMs.

module md5unit (
    input clk,
    input [1:0] reset,
    input [1:0] start,

    input write,
    input [31:0] writedata,
    input [4:0]  writeaddr,

    output [127:0] digest0,
    output [127:0] digest1,

    output [1:0] done
);

wire [1:0] m_write;
assign m_write[0] = write && !writeaddr[4];
assign m_write[1] = write && writeaddr[4];

wire [5:0] cc_iaddr [1:0];
wire [3:0] cc_gaddr [1:0];

wire [127:0] cc_digest [1:0];
assign digest0 = cc_digest[0];
assign digest1 = cc_digest[1];

wire [31:0] cc_mdata [1:0];
wire [4:0]  cc_sdata [1:0];
wire [31:0] cc_kdata [1:0];

genvar i;
generate
    for (i = 0; i < 2; i = i + 1) begin : mccgen
        chunk_cruncher cc (
            .clk (clk),
            .reset (reset[i]),
            .start (start[i]),
            .done (done[i]),
            .digest (cc_digest[i]),
            .iaddr (cc_iaddr[i]),
            .kdata (cc_kdata[i]),
            .sdata (cc_sdata[i]),
            .gaddr (cc_gaddr[i]),
            .mdata (cc_mdata[i])
        );

        mdataram mram (
            .clock (clk),
            .data (writedata),
            .wraddress (writeaddr[3:0]),
            .wren (m_write[i]),
            .rdaddress (cc_gaddr[i]),
            .q (cc_mdata[i])
        );
    end
endgenerate

sdatarom srom (
    .clock (clk),
    .address_a (cc_iaddr[0]),
    .address_b (cc_iaddr[1]),
    .q_a (cc_sdata[0]),
    .q_b (cc_sdata[1])
);

kdatarom krom (
    .clock (clk),
    .address_a (cc_iaddr[0]),
    .address_b (cc_iaddr[1]),
    .q_a (cc_kdata[0]),
    .q_b (cc_kdata[1])
);

endmodule

Conclusion

So now we’ve built are MD5 unit. But how do we make sure it works? In my next post, I’ll discuss verification using the ModelSim circuit simulator.

<- Part 4 Part 6 ->