June 9, 2008 -- One of the biggest challenges in pushing IC design beyond 65nm is the complexity inherent in effective power management. Whether the goal is to reduce on-chip power dissipation to reduce temperature and minimize cooling requirements, or to provide longer battery life to mobile and handheld devices, power is taking its place alongside timing as a critical dimension to be optimized during physical design. The problem is exacerbated by an explosion in the number of mode and corner scenarios that could have conflicting power, timing, and signal integrity (SI) closure requirements. Current place and route solutions are severely limited due to inability of the core static timing analysis (STA) engines to represent more than a single mode/corner combination, resulting in either lost performance or missed time-to-market targets.
The basic low-power design techniques, such as clock gating for reducing dynamic power and multiple voltage thresholds (multi-Vt) to decrease leakage current, are well-established and supported by existing tools for a single mode/corner combination. However, designers are running into difficulty with more-advanced techniques such as designing for power in a multi-corner multi-mode (MCMM) context, multi-voltage flows, and designing power efficient clocktrees. With a multi-voltage supply (multi-Vdd) approach, some blocks use lower supply voltages than others, creating voltage "islands." This flow gets even more complex when dynamic voltage scaling is used to change the supply voltage level during operation. In a multi-corner multi-mode scenario, clock power consumption depends on various factors such as the circuit design style, architectural choice, clock distribution wiring, clock driver sizing, and the capability to disable part of the clocking network. In this article, we discuss some of these advanced low power design techniques and outline the required solutions.
Multi-voltage design with MCMM
An increasingly common technique to reduce dynamic power is the use of multiple voltage islands (domains), which allows some blocks to use lower supply voltages than others, or to be completely shut off for certain modes of operation. This presents new challenges in physical design. Firstly, the tools need to correctly place and route across multiple domains and ensure that the timing and optimization engines honor the multi-voltage domain specifications. Secondly, the multi-VDD implementation system needs to ensure that the multi-mode multi-corner requirements are also satisfied in the same run. Basically, each additional voltage island causes the number of timing analysis mode/corner scenarios to double when all the min/max voltage combinations are considered (Figure 1).
Figure 1. Additional voltage island cause the number of timing analysis mode/corner scenarios to double when all the min/max voltage combinations are considered.
The core architectures of incumbent place and route solutions in the market are at least 10-15 years old and were intended to handle at most one or two scenarios. Physical implementation for ultra-low-power designs must be capable of concurrently analyzing and optimizing for multi-voltage multi-mode multi-corners scenarios.
Let’s now look at an ideal multi-voltage multi-mode multi-corner implementation system that includes significant architectural and algorithmic enhancements to the traditional place and route systems (Figure 2).
Figure 2. A full multi-voltage physical design flow.
- Multi-voltage setup. The physical design environment for multi-voltage designs needs to be done carefully. A little extra time spent in setting up the Unified Power Format (UPF) can pay off by avoiding implementation and verifications problems later in the flow.
- Floorplanning and placement. At this early stage in multi-voltage designs, it is important to correctly instantiate the voltage islands and insert special cells, such as isolation cells, level shifters, power switches, always-on connections and retention cells. The tool should have a fast, prototyping placer to group cells into partitions and assign partition pins if needed. Power and ground routing can then create grids for each voltage island. The tool should provide an easy way to analyze always-on connections, and level shifters and isolation cells to ensure they are correctly placed before proceeding with routing.
- MCMM analysis and optimization. The UPF file, with the definitions for power domains and the power state table (PST), are fed into the tool database, along with the other library and design data. The PST includes combinations of voltages and power states, which are essentially operational modes in a MCMM environment. Tools like Mentor Olympus-SoC, which can perform true multi-corner multi-mode timing and power analysis, can concurrently analyze and optimize across theses modes, corners, and voltage domains for single pass closure. In related timing reports, voltages should be included as a column along with delay, skew, capacitance, and other pertinent data. You can then easily see paths crossing from one voltage island to another.
- Routing and optimization. The router should handle all the secondary power connections for retention flops and always-on buffers. It should respect voltage island boundaries, and change routing topology to meet other design constraints. For example, the router should detour around an island in order to buffer an SI violation on a non-critical net, but allow critical nets to cross an island. To do this, it must get constant updates on MCMM timing and RC, which it uses to find the optimal solution for meeting power, timing, SI, manufacturability, and area constraints. The advanced router is DFM-aware so it ‘sees’ and accounts for the manufacturing issues that affect power, especially leakage power, such as variations in on-chip temperature and thickness.
- Concurrent MCMM power and timing optimization. Final optimization should be done concurrently for leakage power, dynamic power, SI, timing, and area, and handle situations unique to multi-voltage designs, for example, buffering nets that cross voltage islands. Without concurrent MCM leakage and timing optimization, the tool may never be able to resolve conflicting needs across different mode/corner combinations.
Optimizing for leakage power includes replacing low-voltage threshold (Vt) cells with high Vt cells. High Vt cells can increase delay, so it is crucial to have an accurate MCMM timing engine determining which paths can tolerate high-Vt cells and still meet timing under all corner-mode scenarios. The optimizer must also respect isolation cells, level shifters and retention registers, only resizing them with equivalent cells. Likewise, always-on buffers must be respected to avoid breaking connectivity and ending up with the dreaded ‘always-off’ condition.
Ultra-low power MCMM clock tree synthesis
No low-power design strategy would be complete without serious attention to the clock tree. Clocks are the single largest source of dynamic power usage, and clock tree synthesis and optimization is a good place to achieve power saving in physical design.
Low-power clock tree synthesis (CTS) strategies include lowering overall capacitance and minimizing switching activity, both of which are discussed below. However, getting the best power results from CTS depends on the ability to synthesize the clocks for multiple corners and modes concurrently.
Concurrent MCMM CTS allows dynamic tradeoffs among all scenarios simultaneously. The experiences of customers using MCMM CTS show significant reduction in area, number of buffers, skew, total negative slack (TNS) and worst negative slack (WNS), in addition to lower dynamic power. The Table below shows how, for a given mode, a single-corner CTS implementation compares with a 9-corner CTS implementation for a 9 corner design.
There are two opportunities for reducing capacitance in clock trees. One way is to optimize function skew across multiple corners based on flop interactions. Most CTS tools balance global skew across all the flops regardless of which level of the clock tree they inhabit. Designers then have to manually drive CTS to balance the sinks correctly. If the CTS tool can analyze flop interactions, it can derive the exact skew balancing requirements at the different clock tree levels, and across different voltage islands. The result is better buffer count, lower wire length, and lower power. It also simplifies CTS setup since designers no longer have to manually direct functional skew balancing.
The other way to reduce capacitance, and therefore power, in the clock tree is to optimize the leaf clusters. Most of the capacitance in clock trees is found in the pin and wire capacitance of leaf cells. CTS can accomplish this by down-sizing registers, and placing registers to minimize leaf wire length. However, to perform these techniques accurately, CTS must have constantly updated information on the timing and parasitics. This requires a fast, incremental extraction engine and MCMM timing analysis to give accurate feedback on CTS decisions.
Clock gating is a very common technique for lowering clock tree power by shutting off the clock to unused sinks, preventing toggling that uses power but serves no useful function. How the clock gating is implemented, though, varies among CTS tools. Ideally, the tools should make dynamic tradeoffs between clock gate placements based on power consumption. As illustrated in Figure 4, if the probability of switching is equal on both sides of a clock gate, CTS should balance the tree for the best buffer count and lowest wire length. If the toggle rates on either side of a clock gate are different, CTS should minimize the wire length on the high-frequency wires to lower power, even at the expense of higher buffer count or wire length.
Figure 4. Comparison of equal probability of switching with different probability of switching.
Also, clock gating techniques are typically applied at the RTL level before placement and CTS, resulting in over-design of clock trees. Clock restructuring is a post-CTS optimization technique that identifies missed clock gating opportunities. It analyzes enable functions, identifies common sub-terms and can introduce new gates upstream, and intelligently determines the optimal number of gates that still meet setup constraints.
A final consideration in low power physical design is tool capacity. Designers are often forced to implement their larger designs piece-meal, never able to see all the information in context at once. This is of particular concern when optimizing for timing and power in a full chip context, and when you want to consider multiple corners and modes concurrently during CTS and other steps. Physical design tools should be capable of processing 100 million gates or more, hierarchical or flat, so that you can perform full chip-level optimizations without having to use blackbox abstractions. In addition to providing better design results, this would also greatly simplify data management and speed up the design turnaround time.
Traditional place and route solutions are no longer adequate for reaching design closure for complex, ultra-low power designs. Designs with multiple corners and modes that also use a variety of designed-in power reduction strategies require concurrent MCMM timing and optimization to prevent the classic “ping-pong battle” between power and timing closure. Design teams need a place and route system that provides full support for UPF, a variety of low-power design styles, and power closure based on MCMM analysis and optimization.
By Sudhakar Jilla
Sudhakar is the Marketing Director for Place & Route products at Mentor Graphics. Over the past 15 years, he has held various application engineering, marketing, and management roles in the EDA industry. He has been previously responsible for the rollout of several market leading products and initiatives such as Pinnacle, Olympus-SoC, Design-for-Variability at Sierra Design Automation and Physical Compiler, Galaxy-SI at Synopsys.
Go to the Mentor Graphics Corp. website to learn more.