Historically signal processing algorithms were implemented in one of two ways. They were designed using a DSP processor or a discrete hardware based implementation. The DSP offers a simple hardware design process but requires specialized expertise to optimize the software to achieve the system performance requirements. The discrete design could easily achieve the required performance but suffered from the time and expertise required to design an efficient dataflow solution. FPGAs now offer an exciting platform in which to integrate processors and hardware in one programmable platform.
With each new generation, the FPGA platform has aggressively increased in both capacity and speed opening up a new class of applications which require higher levels of integration along with the signal processing capability required in most electronic systems today. To address these new requirements, FPGA suppliers have added embedded processor and the high speed communications interfaces achieving integration not seen outside ASIC implementations.
The embedded processing capability is provided in both the soft processors such as Xilinx’s MicroBlaze and Altera’s NIOS, and in diffused processors such as the Xilinx PowerPC. The soft processors are highly optimized for gate count and offer a low cost and flexible solution to the FPGA designers. However, they have a clock speed which is restricted by the fabric and often lack the processing power required to perform the signal processing algorithms. The second class of processors is the diffused cores such as Xilinx’s PowerPC 405 processor. This solution adds 4-6X the processing performance over the soft processors. Both of these solutions give designers a great deal of flexibility in implementing effective solutions with high levels of integration and lower cost.
System analysis critical for FPGA based DSP platforms
With the increased capacity, complexity and communication available in the FPGA solutions comes a large increase in the challenges presented to the system designer. High speed data is now flowing into the FPGA fabric. Greater demands in processing capability are made to meet the new system requirements such as video, multimedia, imaging and security. To meet the system requirements the bus and memory architectures must support the data flow and not impeded the free flow of data both in and out of the FPGA architectures well as processing these higher complexity algorithms.
Modern high-speed buses have many special burst and other access modes that can greatly increase data transfer speeds. Similarly, modern memories have high-speed access modes, such as sequential row accesses, that greatly increase the flow of data into the system. However, these modes come with certain restrictions and assumptions that must be met for the performance to be achieved. Each of these architectures are manageable when implemented by themselves, however when integrated on the same system it becomes extremely difficult to analyze their interactions. Frequently a system is assembled and booted only for the designer to discover that performance is far below expectations because of unexpected contention, or data accesses and transfers not being able to take advantage of the high-speed operating modes. Without getting visibility into why the degradation occurs, it is impossible to remedy the situation quickly and predictably. Using traditional FPGA hardware techniques such as hardware timers can help identify performance issues, but not give the designer an understanding of why the problem is occurring.
Because of this lack of visibility, designers have traditionally over designed the system architecture or underutilized the opportunities presented by these high-speed components. This approach works as long as there is room for this over design at every stage of the process, but is very wasteful and leads to an expensive and overly complex system design.
What designers need is a system analysis tool that can simulate the assembled system quickly and gather enough information about what is occurring so the designer can pinpoint the root cause of performance problems. This requires modeling and simulation of both the hardware and software components, along with a user interface that can link hardware events, such as cache misses, back to the software that caused the event. Similarly, it requires the simulation of the various data traffic so the designer can observe why high-speed modes of busses and memories are not used effectively.
Algorithm acceleration in hardware
Another design challenge is effectively combining the power of the FPGA fabric and the processors. Traditionally the entire signal processing routines were placed in the fabric due to the inability of processors to handle the performance requirements and the difficulty of generating hardware acceleration to meet the performance required by these routines. However, with the increased processing power of the new class of RISC processors coupled with automated design of hardware acceleration many algorithms can be performed by the embedded processor saving a great deal of design time and insuring flexibility.
Generating C code for the target algorithm is straight forward, however, determining the proper mix of hardware and software is not. To develop an efficient partition, the designer needs to understand the time profile and data flow for each segment of the algorithm. Using this approach, the processing intensive sections of the algorithm can be moved to hardware leaving the less critical portions in the processor.
An effective method to take advantage of the FPGA hardware resources is to create special-purpose accelerators that offload large blocks of functionality from the processor. First this reduces the required data transfers between the processor and hardware and so uses the bus more effectively. Second, this enables concurrent execution of the processor and the accelerators.
Since portions of the algorithm are running on either the processor or the fabric there needs to be an efficient transferal of data from the system space of the processor to the accelerator and back. This is a key design criterion of an effective design. A large portion of the benefit achieved through hardware acceleration can be lost with an inefficient communication path.
Since autonomous accelerators can process data without RISC processor intervention for thousands to millions of cycles, further exploiting the capability of hardware acceleration is structuring the algorithm to enable concurrent processing. A large benefit of autonomous accelerators is that they can run without the aid of the processor, this frees the processor to perform other tasks.
The problem preventing designers creating these accelerators has been the long and error-prone design process of analyzing the architecture options, building the accelerator hardware, creating the drivers to control the accelerator, and then updating the application to use the accelerator. What designers need is an automated tool that can take a partitioning decision and selection of certain architecture options, and then create the accelerator, drivers, and application code.
Design flow for Platform FPGAs
To effectively deal with the complexities complex SOC’s designers need the assistance of system level design tools like the Triton Tool suite by Poseidon Design Systems. These tools enable the designer to address the challenges of the embedded DSP designs by raising the level of abstraction and providing design methodologies to simplify and enhance the process of system design. There are many system level design tools on the market today. There are two specific tool areas that are key to the development of embedded designs. These are SystemC simulation environments and accelerator synthesis tools.
System analysis and optimization
Designers can use the SystemC environment to co-simulate the time-critical algorithms on the processor and system architecture. This provides key analysis capability to the architectural design phase of the program. Bus utilization and memory performance can be measured to insure the specified architecture is adequate to meet the demands of the processor and other system components. Throughput and response times can be evaluated for each component giving the designer a measurement on how well the architecture supports the system communication requirements.
Along with system design analysis, the SystemC simulation tool provides a powerful optimization tool for the application software. The developer can perform detailed profiling of the algorithms measuring which portions consume the majority of the cycles. These routines can then be analyzed revealing the opportunities for optimization. The key performance data collected in this process are pipeline events, cache events and SW-HW coherency. The pipeline is critical to the performance of today’s RISC processors. They are very efficient as long as the pipeline is running smoothly. Therefore it is important to be able to identify pipeline stalls and interlocks so remedies can be taken to enhance the clear running of time critical routines.
An equally important consideration in the software optimization is the cache operation. The cache is a very efficient mechanism for accessing data in the high speed processors. Care needs to be taken to make sure cache misses are limited especially in time critical routines. Large time penalties are paid for cache misses and can destroy performance if they occur in high performance loops.
The last major feature is SW-HW coherency. It is important to link hardware events occurring both in and out of the processor with the software. This is a powerful tool to evaluate causal effect to events. Large saving in development time is achieved by maintaining this link. Developers can select a critical software event and quickly identify the contributing hardware system. This gives the designer and link between the software and hardware domain to properly evaluate the functioning on both architectures.
Another important analysis benefit of the system C tool is for early performance verification. One of the benefits of an FPGA system is the ability to run real data through the system during the design phase of the program. While this is a key part of the verification flow, the SystemC environment provides a much earlier look at the system functionality and performance. To get an adequate look at the performance of the system using an FPGA platform it is necessary to have the system operational. Depending on the complexity this could be completed long after the system design is completed. By performing an early performance verification at the beginning system redesign can be eliminated saving months off the program schedule.
Hardware accelerator synthesis
Hardware accelerator generation tools provide an excellent companion to the system level simulation environment. These tools provide an automated flow to generate hardware acceleration to perform the more processing intensive portions of the DSP algorithms. Requirements for DSP algorithms differ from one to another. It is therefore important to have a flexible architecture to be able to effectively handle these routines. Tradeoffs need to be made between gate size and speed of the accelerator and the tools should provide the control for the user to right size the accelerator to fit the system requirements.
Along with just performing the computation of the selected algorithm the system issues play a major role. Multiple architectural solutions need to be available to make the best use of the resources consumed. For instance, in the Triton Builder-PowerPC the user can select either a bus-based unit which utilizes a DMA to burst the data into the accelerator or an APU interface which transfers the data via processor control. Each has its advantages depending on the size of the input data, latency requirements and the size constraints of the solution. The design of these systems are complex and would take an engineer considerable time to implement. The automated tools enable the designer to select and implement their algorithm in either of these complex architectures with the simple push of a button.
C programming for FPGA hardware
With every automatic synthesis tool there exists restrictions to the C code that is convertible to hardware. C is a very versatile language and gives the programmer a large degree of freedom. This poses a problem for tools which generate hardware. There are structures, pointers and program discontinuities allowed in C which have no reasonable basis in fixed hardware of a programmable fabric. Restrictions in the C-code are inevitable but, it is important to have a tool which supports the structure which is compatible with the current software design methodology. It makes little sense to force a severe structure on the programmers since the resultant code starts looking a lot like the RTL you are trying to replace. Another feature found in these tools is a C-linting function which identifies compatible code and flags the user with warning conditions. This is a very useful feature which can speed the generation of synthesizable code.
The use of an automated partitioning tool is to increase productivity of the design team. A quality tool needs to automate as many of the tasks as possible while giving the flexibility to insure a quality solution. The selection of the partition needs to be under control of the user, designers are in the best position to determine the proper mix of hardware and software. This mix also has interaction with other portions of the system unknown to the tool. Once selected the tool can determine the communication schedule, partition the existing code and generate the RTL for the new hardware. All drivers, test benches and device programming should also be handled by the tool. It is critical to develop a tight design loop between accelerator synthesis and performance verification. It is very inefficient to develop a critical piece of hardware and not verify its performance until the RTL stage of development. As previously mentioned the SystemC simulation environment can provide and excellent platform for this early performance verification.
As discussed, system level design tools bring a powerful set of resources to the design of embedded SOC’s using FPGA’s. The simulation capability provides the ability to abstract away the details giving the designer a clear view of the performance of the target architecture. The accelerator generation tools open up new classes of applications which can benefit from embedded architectures. These tools not only greatly enhance the quality of the resultant solutions but provide large savings in the time to market of the entire program. With these tools system design, optimization and acceleration can be reduced from months to weeks.
By Bill Salefsky and Stephen Simon, Poseidon Design Systems, Inc.
Go to the Poseidon Design Systems, Inc. website to learn more.