June 1, 2005 -- A new type of processor core has been getting a lot of attention lately - a processor you can tailor for a specific application. Configurable processor cores are much faster and can do much more than standard embedded microprocessor cores. Some can even be used instead of hand-coding RTL in IC designs, which greatly speeds SOC development.
What is a configurable processor? What can configurable processor cores do? Why would anyone want to use this type of processor? How can a configurable processor core replace RTL coding in the design of an SOC? These questions and more are answered in this article.
Most popular embedded microprocessor architectures-such as the ARM, MIPS, and PowerPC processors-were originally designed in the 1980s as stand-alone chips. These general-purpose architectures are nearly universally applicable over a wide range of applications because they are good at executing a very wide range of algorithms. However, SOC designers often have to speed up critical portions of their design in hardware because general-purpose processor cores execute some algorithms too slowly to meet performance goals. Even DSP architectures can't match the speed of a custom-tailored hardware.
Hand-coding RTL to speed up designs
Because many applications just don't run fast enough on standard embedded microprocessor cores even with an auxiliary DSP, engineering teams hand-code parts of their SOC design in Verilog or VHDL to achieve performance goals. However, manually generated RTL blocks take a long time to verify, easily doubling the time required to design each block. In addition, fixed-function RTL blocks can't be easily changed once they're designed because of these same verification issues. Yet changes are often needed to accommodate new standards or product features.
Figure 1 shows a close-up look of a typical RTL block designed to execute a particular algorithm, with the RTL datapath on the left and the state machine on the right.
Figure 1. RTL Algorithmic Block = Data Path + State Machine
In nearly all RTL hardware designs, the datapath consumes most of the gates in the RTL block. A typical RTL datapath may be as narrow as 16 or 32 bits, or hundreds of bits wide. The datapath typically contains many data registers and will often include significant blocks of RAM or RAM interfaces that are shared with other RTL blocks.
By contrast, the RTL logic block's finite state machine (FSM) contains nothing but control details. All the nuances of sequencing data through the datapath, all the exception and error conditions, and all the handshakes with other blocks are captured in the FSM, which encapsulates most of the design risk due to its complexity. A late design change made to a manually written RTL block is much more likely to affect the FSM than the datapath because of this complexity.
Configurable processor cores reduce the risks associated with an algorithm's FSM design because they are programmable, which allows the FSM to be implemented with firmware. It is far easier to modify firmware than hardware, due in no small part to the much faster execution speed of instruction-set simulators compared to gate-level simulation. The speed difference can be as much as 1000x or more. In addition, the automatically generated configurable-processor hardware can be guaranteed correct-by-construction by the processor vendor.
What is a configurable processor core?
A configurable processor core is a complete, fully functional processor design that can be tailored or expanded to meet the performance needs of an application or set of algorithms. There are four general ways a processor can be configured:
- By selecting from standard configuration options, such as bus widths, interfaces, memories, floating-point units, etc.
- By adding custom instructions that describe new registers, register files and custom data types, such as 56-bit data for security processing or 256-bit data types for packet processing.
- By adding custom, high-performance interfaces that exceed the bandwidth abilities of the more common shared-bus architectures of conventional RISC and DSP cores.
- By using programs that automatically analyze the C code and determine the best processor configuration and extensions.
Configurable processors are delivered as synthesizable RTL code, ready to be stitched into an FPGA or SOC design. The best configurable processors also come with automatically tailored software-development tools (the compiler, assembler, debugger, linker, and profiler), EDA synthesis scripts, and verification test benches that reflect the designer-defined architectural extensions so that no additional effort is required to ready the configured core for SOC development.
The ability to add custom machine instructions of any width allows an SOC designer to use a configurable processor core to implement data-path operations that closely match the abilities of a manually designed RTL block. In a configurable processor core, the equivalent data paths are implemented using the base processor's integer pipeline, plus additional execution units, registers, and other functions added by the chip architect or SOC designer for a target algorithm(s).
For example, the Tensilica Instruction Extension language (TIE, which resembles a simplified version of Verilog) allows developers to extend Tensilica's Xtensa 32-bit microprocessor architecture for specific applications. TIE is optimized for high-level specification of new data-path functions in the form of added instructions and registers. A TIE description is both simpler and much more concise than RTL because it omits all sequential logic descriptions, including FSM descriptions, and initialization sequences. These complex items are actually-and more easily-developed in firmware.
The new instructions and registers described in TIE are available to the SOC firmware programmer via the same compiler and assembler that target the foundation processor's instructions and register set. Firmware controls the operational sequencing of the algorithm using the processor's base data path, its data-path extensions, and the processor core's existing instruction-fetch, -decode, and ?execution mechanisms. Because the instruction extensions greatly accelerate the processor core's performance on the targeted algorithm, FSM firmware can usually be written in a high-level language (C or C++) and still meet performance goals.
Using the automated approach to configurable processor design
Compilers are now available that can examine the C code for a particular algorithm and suggest processor extensions that will speed up that algorithm. Tensilica's XPRES Compiler can automatically analyze an algorithm, identify critical inner loops and other tuning opportunities, and create many trial processor configurations with TIE extensions that boost performance. The XPRES Compiler creates graphs of different execution-speed/gate-count trade-offs, as shown in Figure 2.
Figure 2. Tensilica's XPRES Compiler generates graphs that clearly show the trade-offs between added gates (area) and increased performance (cycle count).
Programs such as the XPRES Compiler can provide near-immediate feedback to an SOC design team and thus significantly shorten the SOC design cycle.
The configurable processor design cycle
The process for creating the RTL code for a configurable processor core varies from vendor to vendor. Many vendors allow designers to manually insert instructions (hand-coded in RTL) into the processor's RTL code, but this approach does not provide mechanisms to guarantee the operational correctness of the manually created, manually inserted instructions. This shortcoming can be of great concern to designers new to processor design. In addition, the manual addition of instructions to a processor's ISA means that the software-development tools, EDA scripts, and verification test benches do not know about and therefore cannot exploit or test the new instructions. Programmers must use these new, manually-inserted instructions by manually writing assembly-language subroutines and functions and SOC design engineers must develop ways to test these manually generated instructions.
Other vendors, such as Tensilica, automate the process of designing a configurable processor. This automation allows Tensilica to guarantee that the results are correct by construction. Tensilica's process includes the following steps:
- Compile the original C/C++ application and run the program through the XPRES Compiler. The XPRES Compiler analyzes millions of possible combinations of instruction extensions and presents the designer with the best alternatives at various points along a gate-count/cycle-count curve.
- The architect or designer selects the "best" configuration for the SOC's target performance and gate count. Optionally, the designer can manually refine the generated configuration.
- Build the processor using Tensilica's standard automated processor-generator flow. The Xtensa Processor Generator creates the RTL version of the configurable processor and generates tailored versions of all necessary software-development tools, EDA scripts, and verification test benches.
- Compile the algorithm's original, unmodified C code to run on the extended processor core.
Note that if a designer uses the XPRES Compiler to create the processor, no modifications need to be made to the original C code or any other C code to take advantage of the new instructions in the processor's data path-the compiler exploits these new instructions automatically.
Figure 3 shows Tensilica's automated processor design process.
Figure 3. Tensilica's Automated Configurable Processor Design Process
Configurable processors as RTL alternatives
Configurable processors used to implement high-speed algorithms routinely use the same data-path structures as manually written RTL blocks: deep pipelines, parallel execution units, algorithm-specific state registers, and wide data buses to local and global memories. In addition, these extended processors can sustain the same high computation throughput and support the same data interfaces as typical RTL designs through the inclusion of designer-specified ports.
Control of configurable-processor data paths is very different from the RTL counterparts, however. Cycle-by-cycle control of a processor's data paths is not fixed in FSM state transitions but is embodied in firmware executed by the processor (see Figure 4). Control-flow decisions occur in branches; load and store operations implement memory references; and computations become explicit sequences of general-purpose and application-specific instructions.
Figure 4. Processor-Based Algorithm Block:
Data Path + Processor + Software
Migrating SOC design methods from RTL data paths and FSMs to configurable processors with firmware control has many important implications:
- Added flexibility: changing the firmware changes a algorithmic block's function.
- Software-based development: Fast and low-cost software tools can be used to implement most SOC features.
- Faster, more complete system modeling: For a 10-megagate design, even the fastest RTL logic simulator may not exceed a few cycles per second. By contrast, firmware simulations for extended processors run on instruction-set simulators at hundreds of thousands or millions of cycles per second.
- Unification of control and data: No modern system consists solely of hardwired logic. There's always a processor running firmware on an SOC. Moving RTL-based algorithms into a configured processor core running firmware that drives a tailored data path removes the artificial separation between control and data processing.
- Time-to-market: Using configurable processors simplifies SOC design, accelerates system modeling, and speeds hardware finalization. Firmware-based state machines easily accommodate changes to standards because implementation details aren't "set in stone." They can be changed even after the SOC has been fabricated.
- Designer productivity: The engineering manpower needed for SOC development and verification is greatly reduced. A processor-based SOC design approach permits graceful recovery when (not if) a bug is discovered.
The benefits of being able to make changes in software rather than hardware with a processor-based approach cannot be understated. Configurable processors reduce the risk of state-machine design by replacing hard-to-design, hard-to-verify RTL FSMs with automatically generated, correct-by-construction processor cores and firmware.
Fit the processor to the algorithm
Configurable, extensible processors allow developers to tailor the processor to the target algorithms. Designers can add special-purpose, variable-width registers; specialized execution units; and wide data buses to reach an optimum processor configuration for specific algorithms. Tools such as Tensilica's XPRES Compiler automatically determine the best extensions to the processor for a given algorithm.
The following examples show the performance improvements possible, using Tensilica's Xtensa processor as an example.
Accelerating the FFT
The heart of the decimation-in-frequency FFT algorithm is the "butterfly" operation, which resides at the innermost loop of the FFT. Each butterfly operation requires six additions and four multiplications to compute the real and imaginary components of a radix-2 butterfly result. Using the TIE language, it's possible to augment the Xtensa processor's pipeline with four adders and two multipliers so that the augmented processor core can compute half of an FFT butterfly in one cycle.
The Xtensa processor's configurable data-memory bus interface can be as wide as 128 bits so that all four real and imaginary integer input terms for each butterfly can be loaded into special-purpose FFT input registers in one cycle. All four computed output components can be stored into data memory in one cycle as well.
Practically speaking, it's very hard to create single-cycle, synthesizable multipliers for SOCs that operate at clock rates of many hundred Megahertz. Consequently, it's better to stretch the multiplication across two cycles so that the multiplier does not become the critical timing element in the SOC. The resulting additional multiplier latency does not affect throughput in this example and, if necessary, even longer latencies can be accommodated through additional state storage in the butterfly execution unit.
This approach adds a SIMD (single-instruction, multiple data) butterfly computation unit to the processor (using fewer than 35,000 gates including the two 24x24-bit multipliers). The performance improvements appear in Table 1. Adding the SIMD butterfly unit improves the performance by a factor of 300 to 400. The table also shows the code size of the FFT programs with and without the TIE extensions. Note that adding the new butterfly instructions greatly reduces code size.
Table 1. Acceleration results from processor augmentation with FFT instructions
Accelerating an MPEG-4 decoder
One of the most difficult parts of encoding MPEG4 video data is an algorithm called "motion estimation," which searches adjacent video frames for similar pixel blocks so that the redundant pixels can be removed from the transmitted video stream. The motion-estimation search algorithm's inner loop contains a SAD (sum of absolute differences) operation consisting of a subtraction, an absolute value, and the addition of the resulting absolute value with a previously computed value.
For a QCIF (quarter common image format) image frame, a 15-Hz frame rate, and an exhaustive-search motion-estimation scheme, SAD operations require slightly more than 641 million operations/sec. As shown in Figure 4, it's possible to add SIMD (single instruction, multiple data) SAD hardware capable of executing 16 pixel-wide SAD instructions per cycle to an Xtensa processor using TIE. (Note: 16 pixels worth of data in can be loaded in one instruction using the Xtensa processor's 128-bit maximum bus width.)
The combination of executing all three SAD component operations (subtraction, absolute value, addition) for each pixel in one cycle and the SIMD operation that simultaneously computes the values for 16 pixels in one clock cycle reduces the 641 million operations/sec to 14 million instructions/sec, a 98% reduction in the number of required clock cycles. This MPEG4 motion-estimation accelerator is part of a complete MPEG-4 decoder developed by Tensilica as a demonstration vehicle. The MPEG4 decoder adds approximately 100,000 gates to the base Xtensa processor and implements a 2-way QCIF video codec (coder and decoder) operating at 15 frames/sec or a QCIF MPEG4 decoder that operates at 30 frames/sec using approximately 30 MIPS for either operational mode.
Figure 5. MPEG4 SIMD SAD (sum of absolute differences) instruction execution hardware
Other MPEG4 algorithms also can be accelerated including variable-length decoding, iDCT, bit-stream processing, dequantization, AC/DC prediction, color conversion, and post filtering. When instructions are added to accelerate all of these MPEG4 decoding tasks, creating an MPEG4 SIMD (single-instruction, multiple-data) engine within the tailored processor, the results can be quite surprising.
Table 2. MPEG4 decoder acceleration results from processor augmentation with FFT instructions
As Table 2 shows, the resulting SIMD engine acceleration drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10MHz. Without the additional acceleration instructions, the processor would need to run at roughly 300MHz to perform the MPEG4 decoding. There is a substantial difference in power dissipation and process-technology cost between a 10MHz and a 300MHz processor core. It's unlikely that any amount of assembly language coding could produce similarly large drops in the clock rate.
As shown in these examples, it's possible to accelerate the performance of embedded algorithms using configurable microprocessor cores. Designers can add precisely the resources (special-purpose registers, execution units, and wide data buses) required to achieve the desired algorithmic performance instead of designing these functions in manually written, fixed-function RTL. With automated tools such as Tensilica's XPRES Compiler, these configurations can literally take no more than a day to complete.
This SOC design approach can be achieved with an automated tool such as Tensilica's XPRES Compiler or by manually profiling existing algorithm code and finding the critical inner loops in the profiled code. From these profiles, the design team can then define new processor instructions and registers that accelerate critical loops. The result of using configurable processors to implement embedded algorithms is to greatly accelerate algorithm performance in a much shorter time period than manual RTL design. In most cases, designers can replace entire RTL-based algorithmic blocks with configurable processors tuned to the algorithm, saving valuable design and verification time and adding an extra level of design flexibility because of the inherent programmability of this approach.
By Steven Leibson, Technology Evangelist, Tensilica, Inc.
Go to the Tensilica, Inc. website to learn more.