After the user inputs the two multiplicands A and B, the hardware iteratively adds and shifts persuant to the algorithm. The inputs to the 12-bit adder are thus P and a multiple of B. Now, since we are required to provide any given multiple of B from -4 to +4, we must either have these values ready-stored or else generate them on the fly. Since having all 9 multiples ready would require a great deal of precomputation, storage, and multplexing, we see it as undesirable. On the other hand, generating any multiple on the fly would require extra hardware to add. So instead, we took a hybrid approach: we compute and store the quantity 3 * B ahead of time, by adding B and 2B. (This uses the same 12-bit adder as the computation P + XB does, so both inputs to the adder have to be multiplexed.) Then any desired multiple can be generated on the fly by shifting or flipping either B or 3B (to flip a 2's complement number, you flip all the bits and then add one, which can be done by passing a 1 to the carry-in to the adder), then optionally zeroing the result if 0B is needed.

To accomplish this task of generating multiples of B, we designed a unit called FESZ: flip, extend, shift, zero. Whether the FESZ flips, zeroes (or as we named the signal, "keep"s by not zeroing), selects B or 3B as its operand, or shifts (and by how much) is a direct function of which multiple of B is needed, which in turn is a function of the lowest four bits of A and the last bit shifted out, L. The "extend" part of the FESZ is needed since only 9 bits of B and 11 bits of 3B are stored in latches, and the input to the adder must be 12 bits, so the sign extension is done in the FESZ.

The FESZ is implemented in 12 bitslices, one for each output bit of XB ("X times B"). To avoid circuitous routing, each slice only takes as input one bit of B, even though it needs three bits, since B may get shifted by up to two. In each slice, this bit of B is propagated vertically to adjacent bitslices which also need it. (This feature unfortunately prohibits alternate mirroring.) We have provided a gate-level diagram of a FESZ slice. Notice that the control signals, common to all bitslices, which run vertically. These are more low-level than L and the lowest three bits of A, although they are a direct function of those signals. In order to compute these controls from A2, A1, A0, and L, we introduced a FESZ controller, which takes A2, A1, A0, and L as input and outputs the control signals to the FESZ. Note that the FESZ controller skips over the potential intermediate representation of which multiple of B is needed, expressed as an integer. We decided that our implementation of the FESZ controller (writing up truth tables, then running espresso and mpla) was sufficiently simple that we could combine the two steps of determining the multiple of B needed and determining the controls to the FESZ based on that multiple.

After 3B is stored, the core computations of the algorithm are performed in three successive clock cycles, one for each add and shift. In the first part of the clock cycle, the shifted values of P and A (which are latched) are shifted and stored in slave latches. On the second part of the clock cycle, these values in slave latches are sign-extended and added to the desired multiple of B provided by the FESZ using the 12-bit adder. The result gets stored back in P, while at the same time the shifted value from A gets stored back into A.

As all latches need to be loaded during a particular part of the clock cycle (clka or clkb high), care is taken to provide a load control that is qualified at that part of the clock cycle. Qualified signals are generated by AND'ing a stable signal with a clock. If the desired control signal is not stable during the clock phase that we would like it be stable in, we add a temporary latch along the way so that the signal is delayed half a clock cycle. These temporary latches are not load controlled: the load signal is connected directly to a clock.