May 31, 2007 -- Over the past several years, AMD has successfully fought the PR battle with Intel to convince OEMs, PC manufacturers, and consumers that frequency or megahertz (MHz) is not the right metric when evaluating the performance of processors. The essence of the argument is that processor designers can use architectural design techniques that improve the performance of applications running on the processor, without increasing the operating clock rate of the processor. In fact, processors with high clock rates dissipate much more power since power dissipation (P) is proportional to operating frequency (f). In the ultimate acknowledgement to this argument, Intel today ships dual-core chips that run at lower MHz than older generation processors.
The same argument holds true for embedded processors. In fact, embedded products operate under very tight power/ energy budgets. This is well-known for portable devices because extending the battery life is an important requirement, but it is even true for non-portable devices — consumers don't want a (noisy) cooling fan running in their set-top boxes or display projectors and IT managers want routers and switches that minimize energy consumption in the data centers.
Consumers want more performance and longer battery life
Embedded products — particularly, portable consumer devices — are continually incorporating more features and are placing increasing computational demands on application and multimedia chipsets. As a result, it's becoming easier for consumers to listen to music, watch videos, take calls, write e-mails, and even documents on their portable media players (PMPs), cell phones and PDAs.
Conventional RISC processor core IP providers have, over the past decade, responded to this increasing computational demand by designing processors with deeper pipelines that operate at even higher frequencies. This brute force method means that speeds of processor cores have, in the past five years, increased faster than the underlying process technology speed-up during the same time period. While this meets the sheer MIPS requirements for general purpose applications, it isn't well suited for DSP processing applications like multimedia and baseband. Also, higher MHz comes with a large number of disadvantages, such as higher power, larger area, and in many cases, worse performances as discussed below.
The problem with longer pipelines and higher MHz
To achieve higher MHz, conventional RISC processors have to use deeper pipelines. Deeper pipelines come with several disadvantages that include (a) very high penalty for branch delays and branch miss-predictions, (b) high area overhead to support the data forwarding and control logic required for the deeper pipeline, and (c) additional area expensive units such as branch prediction units to alleviate branch penalties. These disadvantages reduce architecture efficiency in that performance degradation because all of these factors reduces the performance benefits gained by the higher frequency.
But by far the biggest penalty of deeper pipelines and higher frequencies is that the power consumption of the processor core shoots up tremendously. In the best case, power increases proportionally with frequency. In reality, the area overhead for the deeper pipelines increases power consumption even more.
So, while using deeper pipelines is a valid approach to address the higher computational demands, it's also the sure way to decrease battery life. And battery life is a key decision metric for consumers when they consider PMPs, cell phones, and PDAs. Power, therefore, has become a first order consideration for SOC designers, along with area and performance.
Increasing the frequency of embedded processor cores just doesn't cut it anymore. They consume too much power and require large memories (which are even more power hungry) to support them.
This begs the question: Is it possible to get higher-performance embedded processors without increasing the frequency?
Higher performance without higher MHz
If the applications targeted at the processor are known, then an application-optimized processor can be created. Such a processor has instructions and functional units that accelerate a particular application or a class of applications. This can be done easily using an advanced configurable and extensible processor such as Tensilica's Xtensa processor by creating instruction extensions.
If, however, the embedded processor is expected to execute a general set of applications, then using a processor with a very long instruction word (VLIW) architecture can serve as a high-performance, yet low-MHz solution. For example, Tensilica's Diamond Standard 570T, a 3-issue VLIW processor, has been rated the highest performance embedded processor (based on EEMBC benchmarks) even when running at 200 to 250 MHz, outperforming competing single-issue cores running at clock rates up to twice as high.
The EEMBC benchmark suites serve as a useful data point for comparison among the various processor cores because of the range of the applications in the benchmark suites and because these applications are representative of the various application domains that are interesting in the embedded SOC domain. As shown in Figure 1, the Diamond Standard 570T outperforms processors such as ARM11 and MIPS 24K on each of the EEMBC benchmark suites.
Figure 1. Comparison of Tensilica's Diamond 570T against ARM11 and MIPS 20K on the EEMBC Benchmark Suite. Note that MIPS 20K is a dual-issue processor and is, therefore, higher performance than a MIPS 24K on a per-MHz basis.
In a VLIW architecture, the processor issues more than one operation per instruction (i.e., per cycle). So, a 4-issue VLIW processor issues four operations per instruction and attempts to increase application performance by executing more instructions per cycle than a classic RISC pipeline. In the ideal case, a 4-issue VLIW effectively provides 4 times the performance of a single issue RISC processor.
In fact, the DSP processor space has also evolved to using VLIW architectures to increase performance. This is evidenced by Texas Instruments’ decision to adopt a VLIW architecture for its highest performance DSP product line, the C6x series.
The high-performance Diamond Standard 570T
The Diamond Standard 570T is a RISC-based 5-stage VLIW processor core that uses the Xtensa instruction set architecture (ISA). The Xtensa ISA uses 24-bit instructions with 16-bit narrow encodings. The VLIW instructions in the Diamond 570T are encoded using 64bits and the processor issues and executes the 64-, 24-, and 16-bit instructions. The software development toolkit (SDK) for the Diamond 570T includes the Xtensa C/C++ Compiler (XCC), along with a complete GNU-based tool-chain that includes the debugger, profiler, assembler, linker, and profiler. The XCC compiler is an advanced, optimizing compiler that automatically extracts instruction-level parallelism from the C/C++ code and automatically bundles concurrent operations into VLIW instructions. The SDK also includes a cycle-accurate instruction-set simulator (ISS), a fast functional compiled simulator (TurboXim), and system models (SystemC and a C-based model) to enable easy and fast modeling of the processor and the system around it.
Higher performance without higher area equals lower power
One of the benefits of using a lower frequency, shallow pipeline processor with a VLIW architecture such as the Diamond Standard 570T over a 8- or 9-stage high-frequency RISC processors is that the area for the Diamond Standard 570T is much lower than the other processors.
Figure 2 shows a comparison of the area and power of the ARM11, MIPS 24K, and the Diamond Standard 570T. Even though, the Diamond 570T is on an average 2.5X higher performance than an ARM11 and about 2.2X higher performance than a MIPS 24K (based on EEMBC benchmarks), the Diamond 570T is much smaller (almost 45% smaller) than both these processors. This is also reflected in its power/MHz and absolute power at the same frequency. Thus, a Diamond 570T is dissipating almost 1/6th the power (43mW) that an ARM11 is dissipating at 400MHz (240mW).
Figure 2. Area and Power Comparisons between ARM11, MIPS 24K, and Diamond 570T in 0.13G. MIPS and ARM area and power numbers, based on the latest data they published on their websites (MIPS does not report 90nm data) (As of March 2007).
High performance, lower frequency, lower power and smaller footprint
The most important metrics when deciding on a processor core for SOC designers are area, performance, power, and price. Traditionally, performance has been associated with higher frequency. The Diamond 570T shows that higher performance can be achieved even while running the processor at lower frequency. This leads to not only lower power because of lower frequency, but also to better architecture-performance efficiency and lower area. This lower area in turn leads to even more power savings when compared to traditional deep-pipeline RISC processors.
As consumer demands continue to grow, power has become a dominant metric in choosing the underlying processor core. Furthermore, the use of multiple specialized processors for tasks such as video, audio, and baseband places an even higher demand on high performance without compromising area and power. We believe that this will fuel the movement towards application-customized processors such as Tensilica’s Xtensa configurable processor and general purpose processors such as the Diamond 570T that achieve high performance at a low frequency.
By Sumit Gupta
Sumit is a Product Marketing Manager at Tensilica, Inc. Previously, Sumit was an Entrepreneur-in-Residence at Tallwood Venture Capital, did post-doctoral research on reconfigurable computing at UCSD and UCI, did his graduate research in parallelizing compilers and behavioral synthesis at UC Irvine, ASIC design at S3 Inc., and software design at IBM and IMEC. Sumit has a Ph.D. in Computer Science from the University of California at Irvine and a B.Tech. in Electrical Engineering from the Indian Institute of Technology, Delhi. He has written one book, several book chapters, and more than 20 peer-reviewed conference and journal publications.
Go to the Tensilica, Inc. website to learn more.