Editing Core Offloads (section)

== Appendix B: Validation ==

This section lists an optional testsuite that may help demonstrate conformance to the specification. The testsuite covers many, but not all, features. Contributions to extend and improve the testsuite are strongly welcomed.

The specification is operating system independent, but the software tests are not. All use Linux. All tools are open source and freely available.

<span id="configuration"></span>
=== Configuration ===

The following features can be configured on Linux using ethtool. The following setters should succeed and the getters should confirm the setting.

Reading back configuration settings does not verify that the behavior matches the stated configuration. This is a necessary, but not sufficient test. The subsequent section will present functional behavior tests.

<span id="queue-length"></span>
===== Queue Length =====

The device must support queue lengths between 512 and 4096 slots.

<pre>For VAL in 512 1024 4096:


    ethtool -G eth0 rx $VAL tx $VAL


    ethtool -g eth0</pre>
It should allow reconfiguration without bringing the link down. Link changes can be counted with

<pre>cat /sys/class/net/eth0/carrier_up_count</pre>
<span id="queue-count"></span>
===== Queue Count =====

The device must support up to 1K queues. It must support independent configuration of receive and transmit queue count.

<pre>For VAL in 1 128 1024:


    ethtool -L eth0 rx $VAL tx $VAL


    ethtool -l eth0


    ethtool -L eth0 rx 1 tx $VAL


    ethtool -l eth0


    ethtool -L eth0 rx $VAL tx 1


    ethtool -l eth0</pre>
<span id="interrupt-moderation"></span>
===== Interrupt Moderation =====

The device must support [2, 200] usec delay. It should support [2, 128] events batched.

<pre>For VAL in 1 2 20 32 200:


    ethtool -C eth0 rx-usecs $VAL


    ethtool -c


    ethtool -C eth0 tx-usecs $VAL


    ethtool -c




For VAL in 1 2 20 24 128:


    ethtool -C eth0 rx-frames $VAL


    ethtool -c


    ethtool -C eth0 tx-frames $VAL


    ethtool -c</pre>
<span id="receive-side-scaling-1"></span>
===== Receive Side Scaling =====

The device must support an indirection table with 1K slots. It may support non-equal weight load balancing

<pre>ethtool -L eth0 rx 1024


ethtool -X eth0 equal 1024


ethtool -x eth0


ethtool -X eth0 equal 2


ethtool -x eth0


ethtool -X eth0 weight 2 1


ethtool -x eth0</pre>
It must allow reconfiguration without bringing the link down. Link changes can be counted with

<pre>cat /sys/class/net/eth0/carrier_up_count</pre>
<span id="programmable-flow-steering-1"></span>
===== Programmable Flow Steering =====

Programmable flow steering is an optional feature. But if supported on Linux it must implement this standard interface

<pre>ethtool -X eth0 equal 1 # steer all non-matching traffic to queue 0

ethtool -K eth0 ntuple on

ethtool -N eth0 flow-type tcp6 src-port 8000 action 1</pre>
ethtool -N eth0 flow-type tcp4 src-port 8000 action 2

<span id="offloads"></span>
===== Offloads =====

The device must advertise the following features and they must be configurable:

<pre>For VAL in off on off:


ethtool -K eth0 rx $VAL # rx checksum offload


ethtool -K eth0 tx $VAL # tx checksum offload


ethtool -K eth0 tso $VAL


ethtool -K eth0 tx-udp-segmentation $VAL</pre>
The device should advertise these offloads as well:

<pre>For VAL in off on off:


ethtool -K eth0 rxhash $VAL


ethtool -K eth0 rx-gro-hw $VAL</pre>
The device may advertise flow steering (including ARFS) as well:

<pre>For VAL in off on off:


ethtool -K eth0 ntuple $VAL</pre>
In all cases the configuration must be tested to be successfully reflected in ethtool -k.

<span id="jumbogram-segmentation-offload-1"></span>
====== Jumbogram Segmentation Offload ======

In Linux, jumbogram segmentation offload is enabled by explicitly configuring a maximum size that exceeds 64KB:

ip link set dev eth0 gso_max_size 262144

<span id="link-layer"></span>
===== Link Layer =====

The device must support L2MTU up to 9216, which is L3MTU of 9198. It should optimize for 4KB payload: 4096 + 40B IPv6 + 20B TCP + 12B TCP options (RFC 7323 timestamp). TCP options are included in MSS calculation [ref_id:rfc_6691], so this gives an MSS of 4108 and L3MTU of 4168.

<pre>For VAL in 1500 4168 9198:


    ip link set dev eth0 mtu $VAL</pre>
The device must support 64 concurrent unicast and multicast MAC filters:

<pre>For BYTE in {1..64}:

    ip link add link eth0 dev eth0.$BYTE address 22:22:22:22:22:$BYTE type macvlan</pre>
The device must support promiscuous (all addresses) and allmulti (all multicast addresses) modes:

<pre>For VAL in off on off:


ip link set dev eth0 promisc $VAL


ip link set dev eth0 allmulticast $VAL</pre>
<span id="functional"></span>
=== Functional ===

Some features are covered by open source functional tests. This testsuite is partial. We strongly encourage organizations to open source their test suites and suggest these for inclusion here.

<span id="receive-header-split-1"></span>
===== Receive Header-Split =====

Header-split is not a user-facing feature. But, we can verify its implementation by testing a feature that is dependent on HS. Linux TCP_RECEIVE_ZEROCOPY, as described in the MTU section, depends on header-split to store payload page-aligned in page-sized buffers. TCP_ZEROCOPY_RECEIVE further requires the administrator to choose MTU such that MSS is 4KB plus room for expected TCP options, and by posting 4KB pages as buffers to the device receive queue.

Linux v6.3 '''tools/testing/selftests/net/tcp_mmap''' tests this feature. Limitations are that the test does not differentiate between header-split and fixed-prefix split, and does not cover all possible TCP options.

The '''hsplit_packets''' device counter must also match the expected packet rate of the test.

<span id="toeplitz-receive-hash"></span>
===== Toeplitz Receive Hash =====

Linux v6.3 '''tools/testing/selftests/net/toeplitz.sh '''verifies the Toeplitz hash implementation by receiving whole packets up to userspace (using PF_PACKET sockets), along with the hash as returned by the device. The test manually recomputes the hash in software and compares the two.

Microsoft RSS documentation~[ref_id:ms_rss] lists a set of example inputs with expected output hash values. The software implementation in Linux kernel tools/testing/selftests/toeplitz.c and the example code in this spec have been verified to pass that test. A vendor may test these exact packets, or test arbitrary packets against the toeplitz.c software implementation.

<span id="receive-side-scaling-2"></span>
===== Receive Side Scaling =====

Linux v6.3 '''tools/testing/selftests/net/toeplitz.sh''' also verifies RSS queue selection if argument -rss is passed. In this mode it uses Linux PACKET_FANOUT_CPU to detect the CPU on which packets arrive. Beyond the Toeplitz algorithm, this test also verifies the modulo operation that the device must use to convert the 32-bit Toeplitz hash value into a queue number.

<span id="checksum-offload"></span>
===== Checksum Offload =====

Linux v6.3 '''tools/testing/selftests/net/csum.c''' verifies receive and transmit checksum offload. On receive, it tests both correct and corrupted checksums. On transmit, it tests checksum offload, UDP zero checksum conversion, checksum disabled and transport mode encapsulation. The test covers IPv4 and IPv6, TCP and UDP. The process is started on two machines. See comments in the source file header for invocation details.

<span id="tcp-segmentation-offload-1"></span>
===== TCP Segmentation Offload =====

'''github.com/wdebruij/kerneltools/blob/master/tests/tso.c''' can deterministically craft TSO packets of specific size and content. A full testsuite that uses this to test all interesting cases is a work in progress.

Common case TSO packets can be tested by generating a TCP stream and measuring byte- and packet counts.

This requires care: packet count as reported by the device must report counts after segmentation, so cannot be used to report the number of TSO packets. On Linux, packets observed with tcpdump or similar tools are observed before TSO, but also before possible software segmentation, so cannot be trusted. The correct counter is the device '''lso_counter'''. Additionally it is possible to count the number of ndo_start_xmit invocations using ftrace.

<span id="udp-segmentation-offload-1"></span>
===== UDP Segmentation Offload =====

Linux v6.3 '''tools/testing/selftests/net/udpgso_bench_[rt]x.c''' implements both a sender and receiver side for a two-machine test. It can send UDP traffic with and without USO and thus can be used to exercise the USO hardware block.

Tests can be performed with various optional features at multiple segmentation sizes. We recommend testing boundary conditions along with common MSS (1500B, 4K, jumbo). The accompanying '''udpgso.sh''' and '''udpgso_bench.sh''' scripts list the relevant boundary test cases (but exercise the software implementation themselves).

<span id="receive-segment-coalescing"></span>
===== Receive Segment Coalescing =====

Linux v6.3 '''tools/testing/selftests/net/gro.c''' implements both a sender and receiver side for a two-machine test. The sender sends a train of packets in quick succession, the receiver verifies that they are coalesced or not, in accordance with the RSC rules. Test cases cover IP (e.g., ToS), TCP (e.g., sequence number) and more. RSC tests are inherently timing dependent. We recommend testing with the maximum RSC context timeout available.

<span id="earliest-departure-time-1"></span>
===== Earliest Departure Time =====

Linux v6.3 '''tools/testing/selftests/net/so_txtime.c '''implements both a sender and receiver side for a transmit test. The sender transmits UDP packets with an earliest delivery time configured with the SO_TXTIME socket option. The receiver validates arrival rate against expectations.

Testing delivery time requires tight clock synchronization. The test can be run on a single machine. Either with two devices, or with a single device by configuring two IPVLAN virtual devices in separate network namespaces. Aside from exercising physical hardware, the setup is similar to that in the accompanying '''so_txtime.sh''', which only differs by using entirely virtual devices to test software pacing.

Testing delivery time requires packets arriving with EDT timestamp unmodified at the driver. This requires EDT support from the Linux queueing discipline (qdisc). '''so_txtime.sh''' demonstrates software EDT with the FQ and ETF qdiscs. ETF also supports hardware offload in Linux v6.3. At the time of writing hardware offload for FQ is in development.

<span id="timestamping"></span>
===== Timestamping =====

Linux v6.3 '''tools/testing/selftests/net''' contains multiple tests that exercise the control and datapath Linux timestamping APIs, SIOC[GS]HWTSTAMP and SO_TIMESTAMPING. But it currently lacks a true regression test for hardware timestamps. This is a gap in this test suite that still needs to be completed.

'''github.com/wdebruij/kerneltools/blob/master/tests/tstamp.c''' requests and parses software and hardware, receive and transmit, IPv4 and IPv6, TCP and UDP timestamps. It needs to be run for all these cases.

<span id="correctness"></span>
====== Correctness ======

Failure to maintain causality can be detected, both for multiple packets at the same measurement point, and for a single packet at multiple measurement points. Clock drift can be measured. Open source conformance tests for these invariants are also future work.

<span id="telemetry"></span>
===== Telemetry =====

Packet and byte counters can be observed with <code>ip -s -s link show dev $DEV</code>. At a minimum, these counters must be verified to match the actual data transmission. One way is to compare this data with data obtained with ftrace instrumentation on ndo_start_xmit.

<span id="performance"></span>
=== Performance ===

<span id="benchmark-suite"></span>
==== Benchmark Suite ====

Hardware limits are demonstrated at the transport layer where possible, as this is the highest application independent layer.

This TCP/IP transport layer testsuite uses neper [ref_id:neper] as the benchmark tool. Neper is similar to other transport layer benchmarks tools, such as netperf and iperf. It differentiates itself by having native support for scaling threads and flows, aggregate statistics reporting including median and tail numbers, and epoll support for scalable socket processing. Neper supports IPv4 and IPv6, TCP and UDP and streaming (“tcp_stream”), echo request/response (“tcp_rr”) and connection establishment (“tcp_crr”) style workloads.

<span id="reproducible-results"></span>
==== Reproducible Results ====

Presented results must be the median of at least 5 runs, with interquartile range (Q3-Q1) less than 10% of the median (Q2). If the interquartile range exceeds this number, it must be listed explicitly. Especially single flow and latency tests may be noisy.

Many factors can add noise on modern systems. Evaluators are encouraged to apply the following system settings to minimize variance:

* Disable CPU sleep states (C-states), frequency scaling (P-states) and turbo modes.
* Disable hyperthreading
* Disable IOMMU
* Pin process threads
* Memory distance: pin threads and IRQ handlers to the same NUMA node or cache partition
** Select the NUMA node to which the NIC is connected
* Move other workloads away from selected cores (e.g., using isolcpus on Linux).
* Disable adaptive interrupt moderation and software load balancing (e.g., Linux RPS/RFS).

<span id="isolating-hardware-and-software"></span>
==== Isolating Hardware and Software ====

Transport and application layer tests exercise both software on the host and the device hardware. One strategy to help disambiguate bottlenecks between the two environments is to run the host at multiple (fixed) clock rates. For latency tests, this will result in measurements

<pre>RTT1 = cpu_cycles * freq1 + hw_pipeline_latency


RTT2 = cpu_cycles * freq2 + hw_pipeline_latency


RTT3 = cpu_cycles * freq3 + hw_pipeline_latency


etc.</pre>
Solving this system of equations generates an estimate of hardware pipeline latency.

<span id="metrics"></span>
=== Metrics ===

<span id="bitrate-1"></span>
==== Bitrate ====

The performance section suggests a standard configuration to demonstrate reaching advertised line rate. This configuration expressed as a neper command is

<pre>    tcp_stream -6 -T 10 -F 10 -B 65536 [-r -w] [-c -H $SERVER]</pre>
Transport layer metrics will report goodput without protocol header overhead. A 100 Gbps device should report approximately 94 Gbps of goodput.

<span id="transaction-rate"></span>
==== Transaction Rate ====

A request/response workload on an otherwise idle system can both test latency at the application level (with a single flow), and small packet throughput (when run with parallel flows and threads).

A single flow avoids all contention and queue build-up. But idle systems may enter low power modes from which wake-up adds latency. Report both single flow and a setup that gives maximum throughput, for instance.

<pre>    tcp_rr [-c -H $SERVER]
    tcp_rr -T $NUM_CPU -F 10000 [-c -H $SERVER]</pre>
Neper reports 50/90/99% application-level latency.

<span id="packet-rate-1"></span>
==== Packet Rate ====

Packet rates at possibly hundreds of Mpps expose software bottlenecks. It is unlikely that advertised minimum packet line rate (PPS) can be demonstrated with transport layer benchmarks like tcp_rr.

A userspace network stack optimized for packet processing, such as DPDK, is reasonable to stress this hardware limit.

A pure Linux solution for packet processing can be built using eXpress Data Path (XDP). Packets must be generated on the host as close to the device as possible. A device that supports AF_XDP, in native driver mode, with copy avoidance and busy polling, has been shown to reach 30 Mpps on a 40 Gbps NIC using the rx_drop benchmark that ships with the Linux kernel. Over 100 Mpps has been demonstrated on 100 Gbps NICs, but these results are not publicly published.

<span id="connection-rate"></span>
==== Connection Rate ====

Neper tcp_crr (“connect-request-response”) can demonstrate connection establishment and termination rate. The expressed target is 100K TCP/IP connections per second, with no more than 100 CPU cores. tcp_crr is invoked similar to tcp_rr, but created a separate connection for each request. Demonstrate with the boundary number of CPUs or fewer. Ideal

<pre>    tcp_crr -T $NUM_CPU -F $NUM_FLOWS [-c -H $SERVER]</pre>
'''Connection Count'''

There is no current test to demonstrate reaching 10M concurrent connection count

<span id="latency-1"></span>
==== Latency ====

Pipeline latency can be measured at the link layer with an echo request/response (“ping”) workload. An upper bound of pipeline latency can be established by measuring RTT between two devices that are directly connected (i.e., without an intermediate switch). In a two-machine test, full RTT is recorded to avoid requiring clocks synchronized at sub microsecond precision. A single machine setup with two devices (or ports on a single device) can report half-RTT. The reported value is an upper bound. 50/90/99 percentiles should be reported.

Low event rate tests can be sensitive to cache and power scaling effects, as discussed in the section on reproducible results. This test alone cannot differentiate Rx from Tx latency.

Hardware timestamps can be used to compute the time spent in the cable (and switch or wider fabric), to reduce the overestimation in the L2 ping test, and to disambiguate Rx from Tx latency.

<span id="appendix-c-revision-history"></span>