Timing and Resource Results

Synthesis Results

The core loop in the 2D convolution accelerator is the loop over the pixels in each row. Each iteration of the loop performs one step of the systolic array to output a kernel - window product and shifts the data into for the next column. The C synthesis report confirms that the loop was pipelined with II=1.

metric	value
PipelineII	1
PipelineDepth	6
TripCountMin	1
TripCountMax	512
LatencyMin	5
LatencyMax	516

Timing Analysis

The key timing analysis metrics are below:

metric	value
shape	16 x 128
clock period	10 ns
latency	47260 ns
cycles	4726

stage	nrows	median_time	mean_time
load	17	460	510.588
compute	17	1340	1261.18
store	17	610	574.118

The following are key points:

Compute time is consistent with a pipeline of II=1 obtain approximately one column per cycle
Load and store times are consistent with reading and writing 4 pixels / cycle (pixels are 8 bits and AXI4 memory interface is 32 bits)

Resource usage

The kernel was synthesized for a xc7z020clg484-1 Xilinx FPGA, the FPGA on a student Pynq-Z2 board. The total usage is as follows:

Module	BRAM_18K	DSP	FF	LUT
read_axi4_stream_impl	0	0	647	159
write_axi4_stream_32_s	0	0	2	25
conv2d_Pipeline_VITIS_LOOP_617_1	0	0	77	188
conv2d_Pipeline_VITIS_LOOP_178_3	0	0	12	52
conv2d_Pipeline_clear_bottom_padding_row	0	0	18	73
conv2d_Pipeline_VITIS_LOOP_617_11	0	0	105	231
conv2d_Pipeline_convolve_row	0	0	805	2290
conv2d_Pipeline_VITIS_LOOP_658_1	0	0	121	363
conv2d	12	0	4556	8551
Total	12	0	4556	8551
Available	280	220	106400	53200

Some key points:

The conv2d module itself is small, implying that, for such a simple kernel, the interconnect to the kernel is the bottleneck.
The resources are also a small fraction of the overall resources. The low utilization implies that greater unrolling or larger kernels (and hence greater throughput) would be easily possible
There are no DSP48 slices used since the multiplications are 8 bits, and hence likely mapped to LUTs.