Yakk wrote:How are you at hardware optimization? Do you have access to an optimizing HDL compiler?
I don't know - I might have access to something like that... No idea really. Certainly large-scale integration and optimization of the whole logic circuit would be one way to deal with this.
I could possibly think about doing some medium-scale integration of the circuit by hand: take, say, 2 columns of 2 rows of processing stages, work out the truth table for their 8 bits of input and 9 bits of output, and optimize that...
I don't see where the (x>7)?(x-3):x shows up in the double dabble algorithm. I see them adding 3, not subtracting 3. So I'm confused -- is that due to your optimization?
The normal double dabble algorithm operates on integers. You start with high-order bits and incorporate more bits by shifting and iterating. I wasn't sure exactly how double-dabble could be applied to floating-point numbers, so I experimented until I found something that worked. Basically, it starts with low-order bits, and uses the inverse of the regular double dabble function. (In integer double dabble, the processing stage would be ((x>4)?(x+3):x)... ((x>7)? (x-3):x) is the inverse of that function.)
Basically, the farther you get from 20
, the more decimal digits are affected. In the integer case, the high-order bits of the input will affect all digits of your output. (Since no power of 2 is also a power of 10...) In the floating-point case, it's the low-order bits that have wider impact (2-1
= 0.5, affecting just one digit, while 2-16
has 12 decimal digits, and any one of those, via addition with higher-order digit values, may impact digits all the way up to 10-1
) - so when dealing with bits corresponding to negative powers of two, you need to start with the least-significant and work your way up to the most-significant ones. Hence, inverting the double dabble stage-processing function. I didn't do the math to know why that works (not exactly), I just experimented and discovered that it did...
(x>7)?(x-3):x can be implemented in ... well, not that many gates. 10 to 20? Doing it 64 times (the number of bits in a 16 digit BCD number) should take less than 2000 gates.
I implemented it in 9 gates, IIRC. (One of them was a three-input AND gate, so I don't know if that counts as one gate or two.)
The thing is, unrolling the "double dabble" loop and implementing it in logic gates means taking that single stage and forming a giant triangle - each successive column has one more processing stage than the one before it (until you reach the number of output digits) The first column has one stage, the last column has a number of stages corresponding to your number of output digits. Since each column incorporates one more bit of input than the one before it, the number of columns is equal to the number of input bits. The total number of processing stages is therefore (digits * (bits-digits)) + (bits * (bits + 1))/2
You aren't implementing a full adder (or subractor) for that are you? (x>7)?(x-3):x is a 4 bit to 4 bit function. Each output bit is a function of 2-4 input bits, costing 1-4 gates to calculate, probably less if you reuse partial results.
Yeah, I already optimized the logic of the 4 bit to 4 bit function... I knew from the start that implementing this thing would be very resource-intensive for the game so I did the optimization pretty much from the beginning, and refined it as much as I could later on. For instance, (x > 7) is equivalent to (x &
, and bit zero flips when the high-bit of the input is set, so out0
I was wasting a bit of logic for a while handling a few of the cases (for instance a lot of "a AND not in3
OR b AND in3
") - eventually I figured out that it was more productive to think about cases where out1
didn't match in1
, and use an XOR gate to flip the state of the corresponding input in those cases.
Another optimization I did was a special case for the stages that only take in one bit of input. The very first column has one processing stage, taking the lowest-order bit as in3
- so instead of implementing ((x>7)? (x-3):x) I just did ((x&8)? 5:0) - no gates at all for those 16 low-order processing stages...