JFFT A tool for generating synthesizeable verilog for streaming FFTs and polyphase filter banks (PFBs). http://www.mock.com/setistuff/jfft Jeff Mock 2030 Gough St. SF, CA 94109 jeff@mock.com First Release 0.0.1 6/23/4 Second Release 0.0.2 7/1/4 Added --cm3 3-multipler complex multiply optimization Added --dec decimated FFT artchitecture Third Release 0.0.3 7/13/4 Added PFB FIR frontend to FFT Added nice plots to makefiles Added example directory for 32-point PFB Fourth release 0.0.4 8/3/4 Added feature to generate programmable downshift in FFT/PFB Added example directory to generate decimating lowpass filter Added example directory to generate complex mixer Added overflow flag. This is not a rigorous overflow check. It checks the butterfly operation for overflow, but does not check rounding or the complex twiddle multiply for overflow. The overflow output does not distinguish between the two channels in the FFT. This would require a lot of delay logic, so instead this is a general fuzzy "something is wrong someplace" flag. In decimated FFTs the flag is valid every clock cycle and should be checked every clock. (Simulation does not currently do this). JFFT is a collection of perl scripts for generating verilog streaming-FFTs that perform an N-point complex FFT every N-cycles. JFFT will also optionally add an FIR filter to the frontend of the FFT to create a polyphase filter bank (PFB). JFFT implements the FFT using the R2MDC technique. An artifact of this method is that the hardware is utilized during only 50% of the clock cycles. JFFT fully utilizes the hardware to simultaneously perform two independent N-point complex FFTs every N-cycles. JFFT has been tested for data widths from 2..32 bits and FFT lengths from 2-points to 32678-points. When targeting for virtex2, the FFT width can currently be no wider than 18-bits and the FFT length no longer than 32768-points. JFFT's strong points are: Verified against FFTW library, "if it matches FFTW it must be right..." Generates verilog for a wide range of data widths and FFT lengths. Tools and implementation are open source. Verilog is well optimized for virtex2. Generates a structural pure verilog implementation or Xilinx virtex-2 specific instances. Generated verilog is intended to be easy to read. Modular method for using a specific target technology. It currently only targets Xilinx, but it is intended to be easy to add support for another target technology. Implementation trade-off are transparent, easy to modify parameters for different design needs. Tools needed The verilog is generated using perl. Version 5.8.0 was used for development, but earlier versions will likely work fine. Development was done on SUSE Linux, but other platforms will likely work fine. Only standard perl modules are used. Verilog simulation is done using cver, which is now availble under GPL from http://www.pragmatic-c.com. cver works much like Verilog-XL, other verilog simulators should work fine. I am currently using cver version 1.10i. JFFT will generate a verilog behavioral model that is pure verilog and relies on no external libraries, this is a good way to get started. JFFT will also target Xilinx virtex2 FPGAs and generates instances of virtex2 primitive from the Xilinx libraries. You need the Xilinx libraries if you want to target virtex2 parts. The verification suite uses FFTW-3.x.x to verify the verilog operation. FFTW is free software avaiable at fftw.org. The test suite builds a C program using GCC and the FFTW library to make the test suite. Any reasonably modern version of GCC should work fine. I am using FFTW version 3.0.1. The makefiles include some facilities for making frequency response plots using gnuplot. I'm using version 3.7 of gnuplot. Directory organization The perl scripts and verilog test benches are stored in the src directory. Most of the scripts are not intended to be run directly. A user will typically run "mkjftt" with proper options and it will run other scripts to generate verilog. The src directory also contains Makefile.def, this defines some common terms and default values for Makefiles in the other directories. You might want to edit this or use it as a template for a project using jfft. The sim directory contains a Makefile for building verilog for an FFT and running a simulation suite against it. The build directory contains a Makefile for synthesizing the FFT for virtex2 using Xilinx's XST. The output is a Xilinx NGC netlist, a dummy verilog black blox declaration for the FFT, and a verilog gate level description of the FFT for simulation in larger system. The output of the build directory can be used to provide a separately synthesized FFT module for a larger project. The sim_r directory is nearly the same as the sim directory, but it runs the same test suite against the post-synthesis gate level netlist generated in the build directory. The examples directory contains a couple of sample implementations. galfa_n is a decimated 8291-pt FFT, galfa_w is a non-decimated 256-pt FFT. Copying one of these is a good starting point for a project. Running JFFT The sim directory is a canonical example for running jfft to generate (and test) verilog for an FFT. Use this as a starting point. The ${XFFT} target is the generates verilog for the FFT. mkjfft is run something like: perl ../blah/mkjfft --imp=virtex2 --n=32 --width=18 --prefix=test --odir=gen --rnd --reorder The full list of options can be seen by running with -? or just looking at the perl script. You options will cary depending on your application. These are the main options. --imp Sets the target implementation, the only targets are virtex2 or behave. The virtex2 implementation instatiates primitives for memory and multipliers that is pretty well optimized, but you need the Xilinx verilog library to simulate. The 'behave' implementation generates pure verilog with no dependence on a particular library. --n Number of points in FFT --width Width of FFT data items --prefix Name of generated module, sub-modules are prefixed with this name, this is useful if a chip has multiple FFTs, it keeps the name space separate --odir Directory to place generated verilog files --rnd When downshifting between stages round the result of the previous stage. --dshift Downshift result every N stages. The useful values are "0" (never), "1" (downshift before every stage), "2" (downshift before every other stage). --reroder By default the output of the FFT is in a bit reversed order alternating between blocks of the two FFTs being performed. This options reorders the output into to streams sorted in frequency bin order with the two FFTs phase aligned. This will generate a module that can be synthesized, but it uses quite a lot of memory. It's use isn't reccomended for synthesis for large FFTs, it is primarily intended for synthesis. --dec Generate a decimated FFT. The top level module adds a clock enable pin "ce" for an signal rate of less than or equal to log2(n). See decimated section below. --cm3 Generate complex multiplies with a technique that uses 3 multipliers and 5 adders rather than the canonical complex multipler with 4 multipliers and 2 adders. This is sometimes a good idea on large FFTs on virtex2. --iwidth Set the width of the input to the FFT. This must be less than or equal to the --width parameter. In the case of an FFT the narrower input to expanded to the width of the FFT and this is just a notational convenience. For a PFB, this is the width of the delay datapath of the PFB (which can be quite a lot of memory). By default this is the --width parameter. These options are for generating an FIR filter on the frontend of the FFT for implementing a polyphase filter (PFB). --pfb Add an FIR filter to the frontend of the FFT to generate a PFB. --cwidth Set the width of the coefficient memories for the FIR filter. By default this is the input width, --iwidth. --ovl Set the overlap of the FIR filter. There is 4*(ovl-1)*N*(iwidth*2) bits of memory for the delay path of the FIR filter. --window Set the windowing function for the filter. This can be rect, hamming, hanning, blackman, or bartlett. --narrow Scale the sinc() function of the filter (default is 1.0). Making this value greater than one makes the bin narrower increasing the cutoff on the boundary of bins. Running mkjfft will generate a collection of verilog files. The top level is called ${prefix} and is stored in the file ${prefix}.v. Naming conventions Here is a short dictionary of abbreviations used in JFFT *_re The real portion of a signal *_im The imaginary portion of a signal *_bf The FFT butterfly operation *_cm A complex multiply (a+bj)*(c+dj) *_delay_n A module that delays the input by n cycles *_mult A parallel 2's comp multiplier *_stage_n Stage n of the FFT operation. Stage 0 is the last stage with a simple butterfly. *_sync_n Delay sync pulse be n clock cycles *_twiddle_n Module to generate complex roots of 1, the FFT twiddle factors. JFFT Interface The typical interface to an FFT generated by JFFT looks like: module xfft ( ck, reset, sync_i, sync_o, a_re, a_im, b_re, b_im, ovf, x_re, x_im, y_re, y_im ); parameter width=18; parameter iwidth=8; input ck; input reset; input sync_i; output sync_o; input [iwidth-1:0] a_re; input [iwidth-1:0] a_im; input [iwidth-1:0] b_re; input [iwidth-1:0] b_im; output ovf; output [width-1:0] x_re; output [width-1:0] x_im; output [width-1:0] y_re; output [width-1:0] y_im; endmodule The sync_i pulse is active the same clock cycle as the first sample of a block. The two input channels are (a_re,a_im) and (b_re,b_im). An FFT is performed on both the A input and B input simultaneously. The sync_o pulse is active when the first data from a computed block is on (x_re,x_im) and (y_re, y_im). If --reorder is specified when the verilog is generated, the FFT of channel A is on the X outputs, the FFT of channel B is on the Y outputs. If --reorder is not specified, there are two blocks of N/2 samples. The first N/2 clocks beginning with a sync_o pulse contain the bit-reversed FFT of the A input across both the X&Y outputs. The following N/2 clocks contain the FFT of the B input in bit-reverse order on both the X&Y output. It is probably worth generating and FFT with the --reorder option and looking at the verilog for reorder.v to understand the natural bit reverse order of the output. FFTs are performed every N cycles and a sync_i pulse should be asserted every N cycles. In return, the sync_o pulse will be asserted every N cycles. --cm3 optimization The complex multiply in the FFT butterfly is normally done in the canonical fashion with 4 multiplers and 2 adders: (A + Bj) * (C + Dj) = (AC - BD) + (BC + AD)j It is also possible to rearrange these terms to get the same result with 3-multipliers and 5 adders: (AC - BD) == (A+B)(C-D) - BC + AD (BC + AD) == BC + AD This has a slightly different numerical behavior than the canonical method, but should be good for most applications and requires 25% fewer multipliers. --dec decimated FFTs If the input signal is decimated it might be possible to save a lot of hardware with the --dec option. This adds a clock enable pin "ce", the input signal arrives at a decimated rate with the clock enable asserted on the cycle when ce is asserted. sync_i is only sampled on cycles when ce is asserted. For a decimated FFT, the ce signal must be asserted no faster than once every log2(n) clock cycles. For example, with a 1024 point FFT an input sample can arrive no faster than once every 10 clocks. There is also a lower bound on the clock rate for the computation pipeline. Regardless of the FFT length, the decimation rate can be no less than 7 clocks for regular complex multiplies or 8 clocks for a --cm3 complex multiplier. For example, with a 64-point FFT, a new sample can arrive no faster than every 7 (or 8) clocks. When an FFT is generated with the --dec option, the top level contains a notice about maximum decimation rates. If these limitations are acceptable, the number of multipliers is greatly reduced. The delay memory requirements are identical (~2N), but the number of multipliers goes from 4*log2(N-2) (or 3*log2(N-2) for --cm3) to a constant 4 (or 3 for --cm3). The multipliers are multiplexed across the log2(N) ranks at the system clock rate. There is also a savings in memory for the twiddle tables. Rather than a table for each rank of the FFT there is a single twiddle table multiplexed across the ranks along with the multiplers. The results of a --dec FFT are very slightly different than an non-decimated FFT. In a non-decimated FFT, there are some multiplier optimizations in the last two ranks. In the last rank the twiddle factors are always 1.0 so no multiply is done. In the next to last rank the twiddle factors are 1 and -j. These are accomplished with multiplexors and an adder an no multiplier is used. In a --dec FFT there is a single twiddle table multiplexed across all ranks of the FFT. In this table 1.0 is represented in binary as 011111... which is just slightly less than 1. As a result, the computations in the last two ranks aren't numerically identical to the non-decimated FFT and differences appear in the LSBs. PFBs Aaron Parons and Dan Werthimer implemented a PFB generator similar to this package: http://setiathome.ssl.berkeley.edu/~aparsons/pfb/pfb_32.html I didn't like being tied to a piece of proprietary software, the VHDL code generation, and being tied to Xilinx FPGAs, so I implemented this package. When used has a multi-channel filter, a PFB has a lot of advantages over an FFT. An FFT has quite a lot of channel-to-channel leakage. A frequency sweep of an FFT is shown in fftex.png. Not that that adjacent channel leakage can be as great as -15dB. The same channel for a 32-point PFB (8x overlap) is shown in pfbex.png. Note the greatly improved leakage to other channels. Most of the frequency bins are shown in pfbex2.png. Targets for making these images are in the sim directory. Number formats All numbers are assumed to be 2's complement fixed point numbers with a sign bit followed by a decimal point and N-1 bits of fraction. The number range is [-1 .. 0.9999] Sync details << The latency from sync_i to sync_o is difficult to calculate, it depends on rounding, the amount of downshifting, and the length of the FFT, and maybe the latency of the complex multiplier if I add an optional optimization here >> After reset the FFT is unsynced. The sync_i signal should be asserted on the first cycle of a block. The output of the first block after sync_i is asserted is undetermined. The second block will properly calculate the FFT of the input signal. If the sync_i is stopped, the FFT returns to the unsynced state and the first block after a proper sync_i is undetermined. Optimizations JFFT is intended to generate efficient verilog, particularly for virtex2. Optimizations include: Last rank of FFT (stage 0) has no multiply (twiddle values are always 1). Next to last rank of FFT (stage 1) has no multiply (twiddle values are 1 and -j) Three strategies for twiddle tables depending on table size: Two separate tables from 0-pi for small tables One dual-port table from 0-pi with extra logic for sin/cos One dual-port tale from 0-pi/2 for extra logic for sin/cos Twiddle tables with 64 values or less are stored in CLBs, larger tables are stored in BLOCKRAM in virtex2. Datapath delays of 128 or less are stored in CLBs, longer delays are stored in BLOCKRAM. Parity bits of blockram are used where possible for FFT widths like 9-bits or 18-bits. For decimated FFT, there are only 3 (or 4) multipliers that are multiplex across all of the ranks of the FFT. There is only one twiddle table stored for decimated FFTs. The PFB frontend is not as well optimized as the FFT. The filter is a direct implemenation of an FIR filter. There are four copies of the filter (re/im, 2-polarizations). There are 'overlap-1' delay memories of length N (FFT length). Each memory consists of storage for re/im and 2 polarizations. The total number of bits of delay memory in the PFB FIR filter is (overlap-1)*4*width*N. There are 4*overlap multipliers. The coefficient ROMs for the filter consist of 'overlap' memories containing N real coefficients. These coefficients are shared across all four filters. The coefficients are about 20% of the memory, the delay memories are about 80% of the memory. This can vary a bit depending on the width of the coefficient memory and the number of overlaps. This is pretty efficient for a non-decimated clock. No attempt is made to optimize for a decimated clock rate. The memory cannot be reduce, but the number of multipliers can be greatly reduced. There is also symmetry in the coefficient ROMs that can be exploited, but in a large filter most of the chip resource is commited to delay memory that cannot be reduced.