Love Pulse Posted August 31, 2019 Share Posted August 31, 2019 The ability to automatically generate optimized hardware from software was one of the primary tenets of system-level design automation that was never fully achieved. The question now is whether that will ever happen, and whether it is just a matter of having the right technology or motivation to make it possible. While high-level synthesis (HLS) did come out of this work and has proven to be highly valuable, but the input is not software. Rather, it is a more abstract description of hardware. Most HLS tools can take C, SystemC or C ++ as input. “The common complaint about C synthesis has been and remains that it is not‘ really ’C or C ++,” says Chris Jones, vice president of marketing for Codasip. “That the language is so instrumented with custom syntax that by the end of it, the user may as well be writing in Verilog.” When talking about synthesis from C / C ++, the term has two meanings. “Technically C ++ is a‘ language, ’but in the context of‘ C / C ++ to hardware ’, it also means the algorithmic abstraction level,” says Dave Pursley, product management director at Cadence. “While HLS will synthesize the algorithms you throw at it, in order to get good hardware you need to write the algorithm from the hardware’s point of view.” That means that the original C code often has to be modified. “Algorithms are usually written in C / C ++ but are lacking the SystemC details necessary for hardware implementation,” says Rob van Blommestein, head of marketing at OneSpin Solutions. “Even the SystemC models are often too abstract for high-level synthesis and must be refined before they can be transformed into hardware. Such models, especially when written by software teams, rarely reflect important aspects of the hardware, such as pipelines, limited internal storage and bounded memory bandwidth. ” And this is where the difficulties start, because hardware and software engineers have different outlooks and levels of experience. “Consider a complex multi-stage image filter,” says Cadence’s Pursley. “In pure software, the C ++ will likely be several functions, one per filter stage, communicating by copying frames or passing pointers to frames. However, that isn’t the algorithm as it will be performed by optimized hardware. Usually, hardware will attempt to minimize storage, avoiding costly frame buffers as much as possible. Instead, to get good QoR, the hardware filters will pass around pixels perhaps via line buffers if it’s a 2D filter. ” One of the tricks to getting the best performance from the hardware implementation is optimization of the parallelism in the application, said Silexica’s Inkeles. But this requires insights in the algorithm to guide the HLS tools about how to implement the parallelism. By using a tool to detect the implicit parallelism in the design and refactor the code to convert it, a developer can maximize the performance of the algorithm when implemented in hardware. Data storage and movement is critical. “I can have tremendous amounts of compute, but I must have the ability to have data that supports that compute,” says Johnson. “This means if I do not have sufficient bandwidth into memory or sufficient access into memory, then I will have a loop that can operate extremely efficiently and fast. But if I am stalled by my data access, then I will not be able to deliver that performance. ” Many algorithms are constrained by memory access performance. “It is usually not computational time that is the bottleneck but getting the data to the computational units that takes most of the time and energy,” adds Klein. “Again, this is the responsibility of designers and not yet compilers.” But there is an even bigger barrier and problem for the software engineers. “High-level synthesis does not have global visibility,” he says. “While it does a good job with the modules it is used on, it cannot take into account activity in other modules, such as those not using high-level synthesis, or in software.” Pursley puts that is more concrete terms. “HLS is essentially a different type of compiler, so as a rule of thumb, the C ++ must be fully resolved at compile time. It explicitly enforces that there are no hidden assumptions about timing synchronization between blocks. For example, you can’t have unbounded malloc () ’s, because in the end, you need to implement a memory of some physical size on actual silicon.” That restriction does not sit well with software engineers. “If I am a software developer, what I want is to say malloc () on a variable, I want memory to get allocated, I want to point to that data and I want the data to return,” says Johnson. “With HLS I have to architect a memory structure based on what I am doing with HLS tool. That is an extremely complex and difficult thing to do, and that is why it often takes 9 or 12 months to develop something. This is a tremendous burden for someone who writes software. They are unable to develop a memory structure. ” Others report similar hurdles. “There are several fundamental challenges to converting software into hardware using HLS, first the code must be synthesizable by the HLS engine,” said Jordon Inkeles, vice president of product at Silexica. “This means that the code must be written or refactored into a hardware synthesizable format, which is not trivial for the software engineer that is accustomed to writing in standard C / C ++. The coding synthesizability guidelines for HLS are substantial and an engineer must become familiar with these guidelines, covering hundreds of pages of documentation ” “Once the code is synthesizable it also requires a degree of awareness of the underlying hardware,” Inkeles said, adding that “when you put something into hardware it is important to consider resource usage. For example, when writing software to run on a processor you may just use floating point for all datatypes and not incur a significant penalty, but when implemented in hardware it leads to extra consumption of valuable hardware resources, which can be costly and also decrease the performance of the algorithm. ” So, is synthesis from software a fantasy? “High-level synthesis, be it from C ++, SystemC, Matlab or other behavioral definition, has come a long way in the past decade,” says Stuart Biles, fellow and director of research architecture at Arm. “It’s a good candidate for a number of applications, particularly data processing pipelines that you might find in radio, audio or image processing domains. When coupled with an FPGA, it facilitates a rapid prototyping, verification and optimization design cycle with feedback from observed performance in development systems prior to deployment. ” It also has seen acceptance in other application spaces. “Other high-level description languages are being used, and gaining market acceptance, from established languages like NML from Synopsys or Cadence Tensilica’s TIE, to new languages such as Chisel and Codasip’s own CodAL,” says Jones. “These are best defined as‘ processor description languages ’and are used in a variety of ways, but mainly to customize instruction sets of established architectures or to design processors from the ground up.” Market pressure is increasing. “The slowdown of Moore’s Law means that processors in general can no longer continue to drive performance like they have over the past 30 to 40 years,” says Clay Johnson, chief executive officer for CacheQ. “Other technologies are required. Those other technologies are GPUs, custom devices in areas such as machine learning and FPGAs. It is known that FPGAs can certainly accelerate things, but the development environment has traditionally, and is today, focused on hardware developers rather than software developers. ” The makeup of many development teams is changing. “One customer told me how many people they had on staff who can write RTL — basically zero, and how many can write C / C ++ - around 30,000,” says Raymond Nijssen, vice president and chief technologist at Achronix. “There are dinosaurs among us who say they can do better than a tool, but they are a dying breed. They are being replaced by people who want to push the button and don’t care if the solution is perfect. What is important to them is that they get a result fairly quickly. The turnaround times are beginning to drive this more and more. So, you pay a price in terms of efficiency and let the tools give you a result much faster. ” CacheQ’s Johnson agrees. “If I am a hardcore FPGA person, will I be able to take a function and design it and get a smaller area and higher performance — probably yes. If you look at large FPGAs, I would sti[CENSORED]te that the hardware person is not going to be able to sufficiently understand all of the algorithmic details enough to implement that at the detailed level. ” In the past, many co-processor solutions have been dogged by communications costs. “There was no good communication protocol between the accelerators that had low enough latency to justify moving the workload from one processing element to another,” points out Nijssen. “If it takes millisecond to move some matrix from one place to another, then the CPU would have been able to complete the task in the same time with less energy due to the time and cost of moving things around.” “Embedded FPGA is going to be interesting in this space,” says Russell Klein, HLS platform program director at Mentor, a Siemens Business. “If the FPGA is a separate device, going off-chip to the FPGA might nullify much of the benefit of moving from software to a hardware implementation. Keeping the FPGA on the SoC will mitigate those problems. ” Finding the right candidate Not all software is suitable for mapping into hardware. “It’s possible to map something as complex as a processor from a high-level specification, though currently the results will not be optimal,” says Arm’s Biles. “Concepts like forwarding and control flow aren’t mapped quite as well as they could be. Data processing is one area where there are fewer control decisions, and this makes it easier to map effectively. Applications without huge numbers of configuration parameters or multiple modes of operation are easier for the designer to be satisfied that an efficient implementation has been synthesized and common functionality not duplicated. ” Performance comes from taking what is a serial process and exploiting concurrency. “The way you get performance is accelerating loops,” says Johnson. “If you look at C code, the vast majority of the time for complex algorithms is spent in loops. The way to accelerate a loop is to make the loop fully pipelined. In a processor, if I have a loop and that loop is n iterations and it takes C cycles to complete, then roughly the amount of time that it takes in a processor is (n x C) / clock rate. If you fully pipeline a loop, then the amount of time is (n + C) / clock rate. ” This is one of the places where HLS adds value. “High-level synthesis tools do a pretty good job of finding and introducing concurrency,” says Mentor’s Klein. “In cases where it cannot extract concurrency, due to the nature of the algorithm description, it will identify the dependencies for the developer, enabling them to manually modify the code to increase concurrency.” Link to comment Share on other sites More sharing options...
Recommended Posts