Exploring the Arrow SoCKit Part IX - Real-time Audio Filters
In the last post, we looked at how to interface with the audio codec on the Cyclone V. Now, we will use the audio codec to implement a real-time audio filter. This filter will take in samples from the ADC, transform them, and output the transformed samples on the DAC. All of the hardware descriptions and software tools for this part can be found in the part 9 branch of the Github repo.
To implement these filters we will use a finite impulse response (FIR) filter. The way an FIR filter works is that the output is the weighted sum of N previously recorded inputs. By choosing different weights, we can achieve different effects, such as low-pass filtering and high-pass filtering. If you’d like to learn more about FIR filters, I’ve written an IPython notebook explaining the concept.
Block Diagram
Before we begin implementing our filter, it’s helpful to draw out a simple block diagram.
Our audio samples will be written to a circular buffer of size N. The weights of the FIR filter will be held in a ROM also of size N (I call it the “Kernel ROM” because the set of weights is often referred to as a “kernel”). To compute the next output, we repeatedly fetch a sample from the audio buffer and a weight from the kernel ROM, multiply the two together, and then accumulate the products (through addition) into a register.
Notice that, in this block diagram, there are registers between each
computational block. This kind of architecture is called a pipelined
architecture, and is fairly common in hardware design. The main benefit of
this architecture is that the pipeline stages are operating concurrently.
That is, on each cycle, the memories are fetching data, the multipliers
are multiplying the data fetched one cycle ago, and the adder is adding the
current accumulator value with the result of multiplying the data fetched
two cycles ago. This means that, if our kernel has size N, the total number
of cycles it takes to compute the result is N + 3
.
This is a lot faster than if we only performed one step of computation at a
time, in which case it would take 3 * N
cycles.
Counting Cycles
In order to implement this filter, we need to be able to compute outputs
within a certain amount of time. An input is available when sample_end
is asserted, and the corresponding output must be ready before the next time
sample_req
is asserted. If we can’t make this deadline, the system isn’t
going to work. This is the “real-time” in “real-time audio effects”.
We are subject to constraints on how long it can take to respond to an event.
We’ve seen that the time it takes to compute an output depends on how large the filter kernel is. Since a larger kernel gives a better gain, it is important for us to figure out how big we can make the kernel without missing the deadline.
With a sampling rate of 44.1 kHz, the sampling period is about 22.7
microseconds. The input sample is available a quarter of the way through the
cycle, and the output needs to be ready at the beginning of the next cycle.
That means our computation must take less than about 17 microseconds.
The main clock is 50 MHz, which gives us 20 nanoseconds per cycle. That gives
us about 17000 / 20 = 850
cycles to perform our computation. So you could
expand the kernel to around 840 words and still be able to meet the deadline.
I chose to use a kernel with 101 words (you generally want the number to be odd).
I chose this by writing a software FIR filter implementation (running on my
laptop, not the Cyclone V) and testing different kernel sizes. You get pretty
good quality at 101, and increasing past that point doesn’t seem to give
much improvement.
Circular Buffers
We need to store the N last inputs to the audio buffer as they come in from the ADC. In order to do this, we need a Circular Buffer. This is a data structure which is basically an array that wraps back around on itself. Data is always written to the “head” of the buffer. The address of the “head” is incremented on each write. When the head is at the last address, it goes back to address 0 on the next write. In hardware, we can implement this using block RAM for the “array” and storing the “head” address in a register.
The audio_ram
component here was generated using the 2-port RAM megafunction
in megawizard. To generate this yourself, give the RAM 128 16-bit words with
no registers on the output.
Multiplier
The best way to implement a 16 by 16 multiplier on the FPGA is to use the dedicated multiplier circuitry. To access the dedicated multipliers, we will need to use MegaWizard. The multiplier megafunction is under “Arithmetic” -> “LPM_MULT”. In the “General” tab of the wizard, select 16 as the width of the “dataa” and “datab” inputs. In the “General2” tab, choose signed multiplication for “Multiplication type” and “Use the dedicated multiplier circuitry” for “Implementation”. In the “Pipelining tab”, choose no pipelining and default optimization. After you’ve made these selections, press “Finish”.
Generating ROM Data
The kernel ROM is just a 1-port ROM. This is easily generated from a megafunction. The hard part is figuring out what the ROM values should be. We want to use a low-pass windowed sinc filter. The values for this filter can be generated using the following C code.
There is one slight problem here though. These are floating point numbers, but we are using integer multipliers in our hardware design. Since all of our weights are between 1 and -1, simply converting the floating point weights to integers and using those clearly won’t work. We could switch to using a floating-point pipeline, but that would complicate out hardware quite a bit. Fortunately, there is an easy way around this problem. We can simply use fixed point arithmetic. Basically, we want to scale all of our weights up, convert to integers, and then scale the result of the computation down by the same amount. So, if our original formula is
sum(kernel[i] * input[n - i]) for all i
With fixed point arithmetic that becomes
1/S * sum(to_int(S * kernel[i]) * input[n - i]) for all i
Where S is some constant. The results will be the same, except with a little loss of precision. Since our floating points numbers are all between 1 and -1 and we want to convert to 16-bit integers, we can simply scale up by the largest signed 16-bit integer value.
You can find the tool I wrote for generating kernels in the
software
directory of the git repo. The tool is called lowpass
and takes two
arguments, the critical frequency and the kernel width. The kernel is written
out in big-endian binary format to standard output. I used a critical
frequency of 880 Hz and a width of 50 (which gives a kernel size of 101).
Once you have the binary, you will need to convert it into intel hex format
for the ROM initialization. I used the
srecord tool to do this.
srec_cat lowpassfilter.bin -binary -o lowpassfilter.hex -intel
Filter Pipeline
Here is the implementation of the computational part of the FIR pipeline. It had read ports to the audio buffer and the kernel ROM. When reset, the audio address is started at the buffer “head” and the kernel address is started at 0. It then increments both addresses on each cycle, wrapping the audio address around once it reaches the end. The computation stops once the kernel address reaches the end. In this way, the audio samples are accessed from oldest to newest. If you’ve looked at the FIR filter formula given in the Wikipedia article, you will notice this is actually backwards. Fortunately, our low-pass filter kernel is symmetric, so it doesn’t matter which direction we go in. If you are trying to use an asymmetric filter (which is rather uncommon), you can simply reverse the order of the weights in the ROM.
One thing to note here are the acc_en_*
registers. The enable signal tells
the accumulator when to add another multiplier result into the register and
when to leave the register value the same. There are three registers here, each
feeding in to the next on each cycle. This is to keep the signal synchronized
along the pipeline stages. On reset
, the first multiplier result has not
yet been computed, so the accumulator should definitely not be enabled.
Similarly, when kernel_index
reaches the last address, there is still
computation occurring in the later stages of the pipeline, so we don’t want to
disable the accumulator yet. Therefore, we need
to put a register for the enable signal between each stage of the pipeline
up to the accumulator. The acc_en_0
register is in the same stage as the
kernel_index
and audio_index
registers. The acc_en_1
register is in
the same stage as the internal register of the memories. Finally, the acc_en_2
register is in the same stage as mult_reg
and is the enable input for the
accumulator.
Hooking it Up
Now we need to connect the computational pipeline to the memory and add ports which can be connected to the audio codec.
Notice that there are two clocks here. That is because the audio codec is
synchronized to a 11.2896 MHz clock, but we want the FIR computation to be
performed as fast as possible (i.e. 50 MHz). The problem is that the audio
codec and FIR filter will need to send signals to each other to tell when the
audio data is valid. We are thus left with the problem of crossing clock
domains. The standard way of solving this is to have two flip flops
back-to-back. Rising edges can then be detected by ANDing the first flip-flop
with the complement of the second flip-flop (i.e. the level was high on the
last cycle but low on the cycle before that). The cur_end
and last_end
registers perform this function for the sample_end
input and the cur_done
and last_done
registers perform this function for the done
output from
the fir_filter
module.
Adding Delay
Simply playing a low-pass filtered version of the input on the output won’t sound very interesting. Since you’ll still hear the input, and the output is simply the input with some frequencies attenuated, the input will drown out the output. What would be more interesting is if we added some delay to the output. This will create a sort of echo.
To create a delay, you will need to save the output samples in a buffer and only start pulling then out after a certain number of samples have been written. Sound familiar? That’s right, we want a circular buffer again. This time it’s a bit simpler, since we only ever read the “oldest” value in the buffer. Therefore, we can make life easier for ourselves by using the FIFO megafunction. It’s under “Memory Compiler” -> “FIFO” in MegaWizard.
In the first tab, choose 16 bits for the width and some large number for the
depth. There are 44100 samples in a second, so a FIFO that is N
deep will
give you a N / 44100
second delay. I used 4096, which corresponds to about 93
milliseconds. In the same tab, make sure reading and writing is synchronized
to the same clock. In the SCFIFO tab, make sure only “empty” and “full” are
selected. In the “Rdreq Option, Blk Type” tab, choose “Normal synchronous
FIFO mode” and “M10K” for the block type. In the
“Optimization, Circuitry Protection” tab, choose “No” for registering the
output. You should also disable the overflow and underflow checking. We will
add our own logic to make sure we don’t write to a full buffer or read from
an empty one.
Putting it all Together
Now that we have the filter and FIFO created, we can modify our audio_effects
module from last time.
The key changes are that we added another setting to our multiplexer and some
logic to read from the FIFO on sample_req
and write to the FIFO when the
FIR filter finishes. We only read when the FIFO is full, otherwise we wouldn’t
have any delay.
Conclusion
If you program this description onto the board, hook up a mic and speakers, and flip switch 2, you should get an interesting echo effect. Try playing around with the filter weights, filter size, and delay time to see how they affect your perception of the sound.