Fixed and Floating Point Numbers Questions

Question 1. Custom FP decoding

Consider a simple floating‑point representation \( x = A \, (-1)^s \, c \, 2^e \), where the scale factor is \(A = 2^{-3}\). The fields are:

sign bit \(s\)
3‑bit unsigned exponent \(e\)
5‑bit mantissa \(c\), interpreted as an unsigned integer (no implicit leading 1)

The bits are packed with the sign bit first, followed by the exponent bits, then the mantissa bits. What decimal values are represented by the following bit patterns?

0 000 10000
1 011 11000
0 101 00101

Question 2. Custom FP encoding

Consider a custom floating-point format \[ \hat{x} = (-1)^s \left(1 + \frac{c}{2^p}\right) 2^{e - 4}, \] where:

\(s\) is a 1-bit sign (\(s \in \{0,1\}\)),
\(c\) is a \(p = 5\)-bit unsigned mantissa field (\(c \in \{0,\dots,31\}\)),
\(e\) is a 3-bit unsigned exponent field (\(e \in \{0,\dots,7\}\)).

For each real number \(x\) below, choose \(s\), \(c\), and \(e\) so that \(\hat{x}\) is as close as possible to \(x\). Give your final answers as \((s, c, e)\) and the corresponding \(\hat{x}\) in decimal.

\(x = 3.0\)
\(x = 0.5\)
\(x = -5.0\)
\(x = 1.3\)

Question 3. FixP linear approximation

You wish to approximate the floating‑point equation \[ y = a x + b \] using Qm.n fixed‑point arithmetic.

Let \(a = 0.3125\), \(b = -1.75\), and choose the format Q3.4 (3 integer bits including sign, 4 fractional bits).

Convert \(a\) and \(b\) into their Q3.4 integer representations.
Write SystemVerilog code that computes an approximation of \(y\) using only Q3.4 arithmetic.

Question 4. FixedP mult with saturation

Write SystemVerilog code to implement the computation \[ y = a \cdot b \] where all variables (a, b, and y) are represented in Q5.4 format. Instead of truncating the result of the multiplication, perform saturation as follows:

Declare the variables with the appropriate bit widths.
Compute the product of a and b in an intermediate variable with full precision.
Then you can shift and saturate or saturate and shift to get back to Q5.4 format for the final variable y.