December 10, 2009 -- Power consumption has become as important as area and performance as a quality metric of SOC designs. Electronic system-evel (ESL) design methodologies enable power-consumption optimization opportunities unreachable for traditional RTL design methods. In addition to providing power-aware architecture-exploration capabilities, leading high-level synthesis tools employ several power-optimization methods to deliver designs with minimum power dissipation. One of the most important of these is advanced clock-gating optimization and analysis.
Clock gating requires going deep down in the design and analyzing register by register which elements can be gated and then building the appropriate logic to control the generation of the registers. Most designers do not have the expertise or time to do this in the most effective and productive manner, if at all. Unfortunately, RTL synthesis tools do not have the "knowledge" of the design’s sequential properties to perform this optimization for them.
Basic clock gating
Clock gating is the most important and widely used technique for reducing power consumption of digital designs. This technique ensures that clock edges are applied to a register only in clock cycles in which the register actually needs updating, thus allowing substantial savings in dynamic power consumption. In addition, clock gating typically leads to overall design area reduction, which decreases design’s leakage (or static) power.
As a rule, gated-clock insertion is performed automatically by RTL synthesis tools. In order to enable these tools to successfully find registers, to which clock gating can be applied, RTL designers use an "explicit enable signal" coding style in their VHDL or Verilog source code, as shown below:
Verilog
always@(posedge clk)
begin
if (rst == 1’b1)
q <= 1’b0;
else if (en == 1’b1)
q <= d;
end
VHDL
wait until clk’event and clk=1;
if (rst = ‘1’) then q <= ‘0’;
else if (en = ‘1’)
q <= d;
end if;
|
This coding style does not explicitly force an RTL synthesis tool to use clock-gating techniques for the node q, but rather allows the tool to choose between two possible implementation options: a re-circulating register with a feedback multiplexer or an ICGC (Integrated Clock Gating Cell).
 |
Figure 1. Explicit clock enable implementation in RTL synthesis. |
Deciding which particular implementation will best suit a particular register of a design depends upon many factors. From the standpoint of minimizing area, an ICGC implementation typically delivers better results, especially for multi-bit registers. Choosing one particular implementation based on a power-consumption point of view is a more difficult task, which requires vector-based power analysis to be performed for each and every register in the design. This is the reason why registers that have an explicit enable signal are often called "clock gating candidates." This name simply reflects the fact that it is impossible to decide in advance of RTL synthesis and power analysis whether clock gating should be used for a particular register or not.
Clock-gating friendly netlists
Fortunately, a high-level synthesis (HLS) tool that analyzes and synthesizes untimed C/C++ algorithms can help. An advanced C synthesis tool that follows the best design practices used by RTL designers today, neither directly instantiates integrated clock gating cells nor decides whether a clock-gating cell should be used in each particular case. Instead, a more flexible approach to clock gating is employed that produces as many registers with explicit enable signals as possible, leaving power analysis and RTL synthesis tools to decide whether the ICGC implementation is required or not. In other words, registers generated with an explicit enable are simply clock gating candidates.
In order to produce the maximum possible number of clock-gating candidates, a special clock-gating optimization is applied, which examines every register in the design and extracts feedback datapaths out of the logic cones to generate logic conditions that may be used as explicit enable signals with the maximum possible efficiency.
This C-level clock-gating optimization and analysis technology yields a "clock-gating friendly" RTL netlist that produces the best achievable clock gating quality, regardless of the RTL synthesis tool in use or the number of levels in the clock gating scheme to infer in the gate-level netlist. The power savings due to clock gating are design and test vector dependent. As a rule, designs with full handshaking at all of the I/Os are more likely to benefit from clock-gating optimization.
Saving area and power
As shown in Figure 2, the quality of results produced by clock gating optimization are much better than those potentially reachable if extraction of explicit enable signals is done by an RTL synthesis tool alone, simply because a high-level synthesis tool uses its "knowledge" of the design’s sequential properties to perform this optimization. This knowledge is usually unavailable to RTL synthesis tools or can be easily missed by designers hand-coding their RTL.
 |
Figure 2. Comparing results of RTL synthesis without ("Solution_1") and with multi-level clock gating ("Solution_2"). Clock-gating optimization reduces both area and power consumption with identical timing and performance. |
Figure 3 shows area and power improvements in a test suite that includes typical designs for wireless networking and video processing applications. These results have been produced with a 65-nm standard cell library. The area and power consumption numbers were reported for gate-level netlists by RTL Compiler after synthesis. The comparison was performed by turning on and off the clock-gating optimization control in the Catapult® C Synthesis tool from Mentor Graphics while all other high-level and RTL synthesis constraints remained the same in both cases.
| Test Case |
RTL Compiler Area |
RTL Compiler Power |
| OFDM Interleaver |
-10.9% |
-83.2% |
| OFDM De-Interleaver |
-12.8% |
-80.9% |
| OFDM Reed-Solomon Encoder |
-5.5% |
-79.6% |
| OFDM Reed-Solomon Decoder |
-15.2% |
-76.2% |
| OFDM Viterbi Decoder |
-13.0% |
-96.4% |
| H.264 De-blocking Filter |
-5.8% |
-69.5% |
| H.264 Inter-Chroma Predictor |
-9.4% |
-93.2% |
|
Figure 3. Improvements in area and power for a wireless networking and video-processing test suite. |
Although improvements in area and power consumption due to clock-gating optimization may significantly vary from case to case, on average, it leads to a 10 to 15% reduction in design area and a 20 to 40% savings in dynamic power consumption. The latter number in some cases can reach 90% or more. Catapult C Synthesis is the only high-level synthesis tool that identifies all clock-gating candidates in the design. This assures that every register that can be gated will be gated, resulting in a design that is far more optimized than what can be done by hand.
By Maxim Smirnov
Maxim Smirnov is a Technical Marketing Engineer at Mentor Graphics Corp.
Go to the Mentor Graphics Corp. website to learn more. |