Monday, December 10, 2007

AEMB Multi-Threaded Mania!

As I mentioned a few blogs ago, I have been toying around with ideas to improve my AEMB. I thought of increasing the pipeline stages and introducing fine-grained multi-threading (FGMT) into the processor. This is my first foray into building a MT processor. So, I wasn't really quite sure what to expect.

Anyway, I decided to try both things out, and the results have been extremely promising. In summary, I have managed to double the clock speed of the processor, while only consuming an extra 50% of resources. At the same time, the processor is now capable of executing two concurrent hardware threads. So, what this means in simple terms is:

  1. Higher clock speeds translate into higher performance. The processor is capable of issuing one instruction each clock cycle. So, doubling the clock speed essentially doubles the instruction issue rate. The AEMB now distributes the clock cycles evenly between the two hardware threads. Assuming that it clocks at 100MHz, each thread will have 50MHz available.
  2. Multi-threading increases efficiency. By multi-threading, the processor now has virtually no wasted clock cycles. This means that on top of doubling the instruction issue rate, it doesn't waste any of the useful clock cycles. On top of that, when dealing with slower devices, it doesn't need to sit around waiting as the second thread can continue to run while the first thread is servicing the slow device.
I've also made the design easily configurable. So, it is entirely possible to strip off some parts of it to further boost performance. For instance, the multipliers and barrel shifters can be removed. If space is an issue, one thread can be permanently disabled in hardware to reduce the space by 33%, while still keeping the high clock speed.

This may seem silly until you think about it a little. Assuming that the original consumed 1000 area units at 50 MHz and the new design consumes 1500 area units at 100 MHz, disabling a single thread will convert it back to 1000 area units but at 100 MHz. So, it can get the best of both worlds. However, it is still bound by the same restriction that only half the clock cycles can be used. But this gives it the added ability to interface with higher speed devices.

Seeing that the new core is able to perform better than the old design, even when it's utterly crippled, there isn't any real reason to keep the old design around much longer. I will notify my users of the new design and get them to migrate over to it as soon as possible. However, I will keep both designs in the repository for at least a few months, until I can be sure that all the kinks are ironed out with the new design.

Looks like the old design is going to be ignored from now on. I will freeze development on it and only perform minor bug fixes.

No comments: