Loop Optimization and Pipelining Questions

Question 1. Two Cycle Design

Consider the following SystemVerilog code that computes an output y based on inputs x1 and x2 in a single cycle:

always_ff @(posedge clk) begin
    xreg1 <= x1;
    xreg2 <= x2;
end
always_comb begin
   z1 = f1(xreg1, xreg2);  // delay = 3.8 ns
   z2 = f2(xreg1, xreg2);  // delay = 2.1 ns
   z3 = f3(xreg1, z2);     // delay = 1.5 ns
   y = f4(z1, z3);         // delay = 1.2 ns
end

What is the critical path delay of this single cycle design?
Rewrite the code inserting pipeline registers on z1, z2, xreg1 to achieve a two-cycle design.
What is the critical path delay of the two-cycle design?

Question 2. Pipeline cycle values

Consider the following SystemVerilog pipeline:

always_ff @(posedge clk) begin
    x_s0 <= x;
    z1_s1 <= 10*x_s0;
    if (z1_s1 > 100) begin
        z2_s2 <= 150 - (z1_s1>>>1);
    end else begin
        z2_s2 <= z1_s1;;
    end
end
always_comb begin
    y = (3*z2_s2>>>2);
end

which has input x and output y. Suppose the inputs x on cycles 0, 1, and 2 are \[ 3, 6, 12 \] and are unknown (?) on later cycles. Write the values of the signals x_s0, z1_s1, z2_s2, and y for cycles 0 through 5. Assume that all registers are initialized to ? at cycle 0 (unknown).

Question 3. Pipelined maximum

Given an input x_0, we wish to create a hardware module that computes the maximum

\[ max2 = \text{max}(x_0, f(x_0), f(f(x_0))). \]

We are given a SystemVerilog combinational function that implements f

z = f(x);

Write a pipelined SystemVerilog code for this module with II=1. On the first stage, only register the input -- do not do any further computation. You will need to then insert registers between stages, but each stage can have at most one instance of the function f. The final output should be combinational.

Question 4. Loop carry dependency

Consider the following two loops. Each loop contains a loop-carried dependency that affects the minimum possible initiation interval (II) when pipelined in Vitis HLS.

Loop A

sum = 0;
ind = 0;
for (int i = 0; i < 100; i++) {
    sum += x[ind];
    ind = ind_next[ind];
}

Loop B

sum = 0;
ind = 0;
for (int i = 0; i < 100; i++) {
    sum += x[ind];
    if (sum > 0)
        ind = ind_next_pos[ind];
    else
        ind = ind_next_neg[ind];
}

Assume the following operation latencies:

A memory read takes 1 cycle.
An addition takes 1 cycle.
A comparison and mux (the if selection) take 1 cycle.

For each loop, one way to compute the minimum initiation interval (II) is to follow these steps:

List the operations performed in one iteration of the loop. For example, a step could be xi = x[ind] corresponding to loading the values from the array or sum += xi corresponding to the addition operation.
Assign the operations to pipeline stages, assuming each operation takes one cycle. Write your answer in the form:
Stage 0: ... Stage 1: ... Stage 2: .... For example, in Loop A, one possible assignment is:
Stage 0: xi = x[ind];, Stage 1: sum += xi;, ...
Identify any loop-carried dependencies.. For example, Stage 0 of iteration \(i\) depends on the value of ind from iteration \(i-1\).
Compute the minimum II as the number of cycles required before the next iteration can begin. For example, if Stage 0 of iteration \(i\) depends on Stage 2 of iteration \(i-1\), the minimum II is 3 cycles.

Using this procedure, determine the minimum II for:

Loop A
Loop B

Question 5. Wide band memory

Consider the following loop, where x is stored in a single-port memory (only one read per cycle is allowed):

int x[3*n], d[n];
for (int i = 0; i < n; i++) {
    d[i] = x[3*i]   * x[3*i] +
        x[3*i+1] * x[3*i+1] +
        x[3*i+2] * x[3*i+2];
}

Each loop iteration requires three reads from x: indices 3*i, 3*i+1, and 3*i+2. Assume:

Each memory read takes 1 cycle.
The arithmetic operations are fully pipelined and do not limit II.
Only one read from x may occur per cycle.

(a) What is the minimum possible initiation interval (II) for this loop?

(b) Now suppose the data is stored in a packed format using a 96‑bit word:

ap_uint<96> xpacked[n];

Each 96‑bit element contains the three 32‑bit values previously stored at x[3*i], x[3*i+1], and x[3*i+2]. Rewrite the loop using xpacked and determine the new minimum II, assuming the same single-port memory. You can use the syntax p.range(31, 0) to extract the lower 32 bits of a 96‑bit variable p, and similar code to extract the other 32-bit variables.

Question 6. Moving average

Consider the following 4-tap moving average computation:

for (i = 0; i < ny; i++) {
    yi = 0;
    for (j = 0; j < 4; j++) {
        yi += x[i+j];
    }
    y[i] = (yi >> 2);
}

Assume that x is stored in a single-port memory, so only one read from x may occur per cycle.

(a) What is the minimum possible initiation interval (II) of the outer loop?

(b) Suppose the inner loop is fully unrolled using #pragma HLS UNROLL. Does this change the minimum II? Explain why or why not.

(c) Now consider using a delay line to reduce the number of memory reads per iteration:

int xdly[4];
#pragma HLS ARRAY_PARTITION variable=xdly complete

Rewrite the loop using the delay line so that the outer loop can achieve II = 1. Your rewritten loop should read only one new value from x per iteration.

Question 7. Row-column additions

Consider the following declaration:

float X[512][4];
#pragma HLS ARRAY_PARTITION variable=X dim=2 factor=4 cyclic

This pragma partitions the second dimension (dim=2) into 4 separate memory banks. Thus, each column of X is stored in a different bank. This means that values X[i0][0], X[i1][1], X[i2][2], and X[i3][3] may all be read in the same clock cycle, because they come from different banks.

Write a pipelined loop that computes the sum of each row. Your loop should run in approximately 512 cycles + pipeline latency.
Write a pipelined loop that computes the sum of each column. Your loop should also run in approximately 512 cycles + pipeline latency. Hint: you will need four independent accumulators.

Question 8. Memory with unrolling

A Vitis HLS kernel contains a loop that may be unrolled for performance:

float x[1024], y[1024];

for (int i = 0; i < 1024; i++) {
#pragma HLS UNROLL factor=UF
    y[i] = g(x[i]);
}

To support unrolling by a factor UF, the arrays must be partitioned into UF independent memory banks. Assume Vitis maps each bank into separate BRAM blocks with the following constraints:

Each BRAM block is 18 kbits.
Each array element is 32 bits (4 bytes).
Different banks must use different BRAM blocks, and arrays x and y must not share BRAM blocks.

How many BRAM18 blocks are required to store the arrays x and y for each of the following unroll factors:

UF = 1
UF = 2
UF = 4