Jump to content
Facebook Twitter Youtube

[Hardware] How to deploy multi-hundred-gigabit networking on commodity hardware


#Drennn.
 Share

Recommended Posts

How can packets on commodity hardware be processed at multi-hundred-gigabit per second rates? In collaboration with the KTH Royal Institute of Technology, Ericsson Research sought to optimize the performance of network functions running on commodity hardware, enabling them to run at multi-hundred-Gbps. Read their findings here.

6g-network-compute-fabric-2-124775fc9925

The need for flexibility, faster time to market, and lower deployment costs are factors driving the trend towards Network Function Virtualization (NFV), as well as realizing network functions on commodity hardware as opposed to specialized and proprietary hardware. Benefits of NFV also include easier prototyping and better scaling/migration.

However, commodity hardware was not explicitly designed for packet processing. This can cause performance issues with NFV, particularly at high rates (such as 100/200/400 Gbps), where new packets are received every few nanoseconds. The challenge is escalated by the fact that processor performance is no longer increasing at its previous rate.

Through joint research between Ericsson and KTH Royal Institute of Technology, we looked for new ways to achieve high performance on top of commodity hardware and at really high rates. We concluded that optimizing both the software stack and the underlying hardware is necessary for realizing high performance network functions. Our research results show that for the first time, software-based network functions deployed on top of commodity hardware can process packets at more than 100 Gbps while only using one core.

More specifically, we have analyzed the impact of the following factors on the performance of high-speed network functions: 

efficient hardware I/O management
holistic software optimizations
The former, I/O management, focuses on optimizing the available functionalities in the current hardware, which is a necessary step before focusing on software optimizations. At the same time that it benefits from efficient hardware I/O management, the latter performs whole-stack optimization of network functions to use the underlying hardware more

Optimized cache management for I/O intensive applications
In a previous blog post, we showed how to squeeze the maximum performance out of Commercially Available off-the-shelf (COTS)-based hardware with careful memory management. Aligned with that, we further evaluated the impact of the processor’s cache management on the performance of I/O intensive applications when evolving towards multi-hundred-Gbps rates. We systematically studied the impact of data transfers between I/O devices and the processor’s cache, aka Direct Cache Access (DCA), when processing packets at high rates. 

Direct Cache Access
DCA is a technique that enables I/O devices to send their data directly to the processor’s cache rather than main memory. The latest implementation of DCA in Intel processors is Data Direct I/O technology (DDIO), illustrated in the figure below. Using DDIO avoids expensive memory accesses and therefore improves performance.

main-memory-1310929294dcf6156ce9a402f613

Fig 1. Data Direct I/O Technology (DDIO) sends I/O data directly to the processor cache.

We identified and demonstrated that moving toward faster link rates makes DDIO less efficient. The reason is that DDIO can only use a limited portion of the processors’ cache – it needs to share the cache with other applications – so at higher speeds, DDIO cannot accommodate all I/O transfers in the cache. To find a solution, it’s necessary to fully understand the interaction of applications with DDIO. 

We deeply investigated this issue and compiled a set of guidelines that enable application developers to fine-tune DDIO for their application and achieve suitable performance at multi-hundred-Gbps link rates as a result. For instance, our investigation reveals that when an application is I/O intensive, meaning when communication overhead is higher than computation overhead, it is necessary to carefully size the DDIO capacity to achieve low latency at high rates. Additionally, we argue that it is essential to bypass the cache, either via disabling the DDIO or by

scenario-1310959294dcf6156ce9a402f613ac2

Fig 2. Showing the impact of the guidelines when processing packets in different scenarios . The total achieved throughput is written on the bars.

The baseline indicates that no optimization is applied to the system. Scenario I shows the results for a carefully sized DDIO, while scenario II and III show the impact of bypassing cache via two different methods.

Additionally, our results show that carefully tuned I/O transfers could reduce the packet processing latency by up to 30 percent when receiving traffic at 100 Gbps.

For full details, please refer to our ATC’20 paper, Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks.

 

Whole-stack optimizations for software packet processing
Most network functions can be realized via general purpose packet processing frameworks to provide simplicity and flexibility. These frameworks use the same codebase when building different network functions by simply adapting the software dynamically at runtime. However, this way of realizing network functions comes at the cost of lower performance than specialized software.

Moreover, these frameworks fail to take full advantage of the underlying commodity hardware, as the code/binary executable is not specialized for a given network function and therefore contain lots of software inefficiencies. We have proposed a new system, called PacketMill, to optimize general purpose packet processing frameworks to achieve (near) optimal performance while maintaining simplicity and flexibility. 

packet-mil-1310939294dcf6156ce9a402f613a

PacketMill’s objective is to produce an optimized binary executable while maintaining high-level modularity and flexibility of general purpose packet processing frameworks. To do so, PacketMill grinds the whole packet processing stack, from the high-level network function configuration file to the low-level user space network drivers, to mitigate inefficiencies and produce a customized binary for a given network function. More specifically, PacketMill performs the following optimizations:

Introduces a new way to facilitate the delivery of packets to network functions from the driver.
Modifies the source code of the packet processing framework for a given network function.
Exploits modern modular compiler toolchains to perform link-time optimizations and reorder the important data structures used in packet processing frameworks. 
Using these three optimizations, PacketMill can use the underlying hardware more efficiently, enabling per-core 100-Gbps networking. Moreover, it achieves a better performance than other packet processing frameworks such as FastClick, BESS, and VPP.

packet-size-1310949294dcf6156ce9a402f613

Fig 3. Performance improvements achieved by PacketMill when applying whole-stack optimizations to software packet processing

Full details can be found in our ASPLOS’21 paper PacketMill: Toward Per-Core 100-Gbps Networking.

 

  • I love it 2
Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
 Share

WHO WE ARE?

CsBlackDevil Community [www.csblackdevil.com], a virtual world from May 1, 2012, which continues to grow in the gaming world. CSBD has over 70k members in continuous expansion, coming from different parts of the world.

 

 

Important Links