Joseph Heenan's blog: 2020

On a project I've been working on we recently replaced various test servers with Intel 10th gen NUCs - or to be exact, the BXNUC10I7FNH1 - though for the moral of this story appears to apply to many Intel NUCs and in fact also a wide range of motherboards that use Intel ethernet controllers.

The servers run linux, Ubuntu Server 20.04 - though this problem persists across a wide range of linux distributions and versions.

Shortly after we migrated onto the new servers, we discovered weird networking issues - sometimes the database backup (which copies to a remote host over ssh) would fail with 'Received disconnect from 192.168.1.64: 2: Packet corrupt'.

The test scripts running where sometimes behaving oddly too - they load files over NFS, and sometimes those files would load with one block missing, or would fail to load, or would load but would take 30 seconds or more longer than usual.

Investigation revealed messages like this in /var/log/kern.log on the host machine:

Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   TDH                  <58>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   TDT                  <6b>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   next_to_use          <6b>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   next_to_clean        <57>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] buffer_info[next_to_clean]:
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   time_stamp           <113de7945>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   next_to_watch        <58>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   jiffies              <113de8248>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574]   next_to_watch.status <0>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] MAC Status             <40080083>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] PHY Status             <796d>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] PHY 1000BASE-T Status  <3c00>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] PHY Extended Status    <3000>
Sep  5 05:10:32 atsnuc1 kernel: [1333685.771574] PCI Status             <10>
Sep  5 05:10:33 atsnuc1 kernel: [1333686.763389] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
Sep  5 05:10:38 atsnuc1 kernel: [1333692.507952] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

which coincided with the time the problems occurred.

Further investigation showed this is far from an isolated problem - a quick google revealed thousands of posts about similar issues, dating back over about 10 years, often culminating in a bug report that (if you were lucky) referenced a particular fix for a particular chipset - some of which were allegedly included in kernels newer than the 5.4 one Ubuntu server 20.04 ships with. (For completeness, I'm sure many of these problems were caused by faulty hardware like bad cables, but from the sheer number it's also clear than many weren't).

We tried a range of kernel versions, right up to the very bleeding edge 5.6.x, without any change in behaviour. We tried changing out various pieces of hardware. Nothing helped.

The eventual conclusion of many of these posts is that a workaround is to disable the offloading of checksums to the network hardware - the problem is fairly well explained in a blog post by Michael Mulqueen.

I've been unable to figure out if this is a hardware or software problem - but if it's a hardware problem, it's a bug that affects a number of entire chipset lines.

The workaround that seems to work for most people is to disable the basic offload function by running:

ethtool -K eno1 tx off rx off

(replacing 'eno1' with the relevant interface name)

This is lost on reboot; to apply it automatically on each boot (on Ubuntu Server 20.04), create an executable file /etc/networkd-dispatcher/routable.d/10-disable-offloading with contents:

#!/bin/sh

# disable TCP offload
logger $0 -- "Running: ethtool -K eno1 tx off rx off"
logger $0 -- `ethtool -K eno1 tx off rx off 2>&1`
logger $0 -- done

(This actually runs more than once during boot etc, but it's a no-op if it's already run.)

So far this is solving it for me - tracking all this down wasted a whole bunch of time for my colleagues and myself; I'm frankly really quite shocked that there seems to be very little sign of Intel putting any effort into solving this problem properly. For future projects I'm going to be avoiding Intel chipset hardware.

Here's two more links with a bit more background:

https://serverfault.com/questions/421995/disable-tcp-offloading-completely-generically-and-easily

https://www.freedesktop.org/software/systemd/man/networkctl.html

Joseph Heenan's blog

Monday, 7 September 2020

Ubuntu Server 20.04: Intel NUCs / linux / e1000e driver, checksum offloading and 'Detected Hardware Unit Hang'

Wednesday, 29 April 2020

Fixing smart mailboxes showing the wrong messages / erasing spotlight indexes on macOS Catalina