The Curious Case of Debian 12 and the SSH Failure

2nd Jan 2024

This site is served out of a DigitalOcean droplet that I've had since perhaps 2013, which was set up with Debian 7.0 Wheezy; the same machine is my mailserver. The system has been upgraded over the years, all the way up to Debian 11 (Bullseye), and has always been fun to try to keep up to date.

It starts so innocently...

It was just after Halloween 2023, and I was signing up to some SaaS provider's service. The site popped up a message saying I'd be receiving a verification email with a thing to click on, and then I could log in. Fairly normal signup behaviour for a site at this point.

Except the email never arrived. I headed over to mail.log to see what Postfix was saying...

Excerpt from /var/log/mail.log postfix/smtpd[1141504]: connect from m206-43.eu.mailgun.net postfix/smtpd[1141504]: SSL_accept error from m206-43.eu.mailgun.net: -1 postfix/smtpd[1141504]: warning: TLS library problem: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543: SSL alert number 42: postfix/smtpd[1141504]: lost connection after STARTTLS from m206-43.eu.mailgun.net

Well, that's weird. A little searching around reveals a thread on StackExchange where a helpful dave_thompson_085 has this advice:

Receiving alert bad certificate (code 42) means the server demands you authenticate with a certificate, and you did not do so, and that caused the handshake failure...

Find a certificate issued by a CA in the 'acceptable' list...

So there's some kind of problem with the other end's CA. But it's Mailgun, you'd expect them to have a handle on keeping things up to date; let's check what their certificate looks like.

$ openssl s_client -showcerts -connect api.mailgun.net:443 CONNECTED(00000005) depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root G2 verify return:1 depth=1 C = US, O = DigiCert Inc, CN = DigiCert Global G2 TLS RSA SHA256 2020 CA1 verify return:1 depth=0 C = US, ST = Texas, L = San Antonio, O = "MAILGUN TECHNOLOGIES, INC", CN = *.api.mailgun.net verify return:1 ...

This isn't the certificate used by Mailgun when attempting to send that email, but it is a certificate used by Mailgun, and it's signed by the DigiCert CA. If we check with DigiCert, we see that the 2020 CA certificate has been superceded as of March 2023, so that's probably our problem: our copy of DigiCert's CA is too old.

No problem, we just update it:

# apt search ca-certificates Sorting... Done Full Text Search... Done ca-certificates/stable,now 20210119 all [installed] Common CA certificates

Ah, right, yes. Bullseye has certificates up to 2021, and any CAs newer than that are in Debian 12 (Bookworm). And this, as I tooted at the time, is where things go awry:

The state of affairs as of 5pm on Nov 9th

And now we have a problem

After the upgrade to Bookworm, connections over ssh would drop immediately:

$ ssh imrannazar.com Connection closed $ ssh -v imrannazar.com OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 47: Applying options for * ... debug1: SSH2_MSG_KEXINIT sent Connection closed $

So what's going on here? Evidently something about the server's new version of ssh has broken incoming connections... Let's get into the VM's console to start a debug copy of ssh on a different port, and see what happens.

Running a debug sshd on the server # /usr/sbin/sshd -D -d -o Port=9001 ... Missing privilege separation directory: /run/sshd # mkdir /run/sshd # /usr/sbin/sshd -D -d -o Port=9001 Connecting to the debug sshd $ ssh -p9001 imrannazar.com Connection closed $

At least it's consistently crashing. Flipping back to the VM console, we get:

VM console with crashed sshd — VM console with output of sshd having crashed

Aha, a lead! This error looks suspiciously like it would cause things to fall over in a heap:

Fatal glibc error: cannot get entropy for arc4random

arc4random and the getrandom syscall

Why would this suddenly become a problem in Debian 12, when it wasn't a thing before? As it turns out, arc4random is a method of generating random numbers that's been in OpenBSD for years, but only made it to glibc (and thus to OpenSSH on Linux) in July 2022. My previous Bullseye kernel was up to date as of Jan 2021, so it makes sense that my previously installed OpenSSH didn't have the support required.

Now, what is the ssh daemon doing to get this error? A useful way of finding out which system calls are being made by a given program is to run it through strace:

# strace -f /usr/sbin/sshd -D -d -o Port=9001 ... [pid 1402] poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 2, -1 <unfinished ...> [pid 1640] <... write resumed>) = 81 [pid 1640] getrandom(0x7f721688bd90, 16, 0) = -1 ENOSYS (Function not implemented) [pid 1640] openat(AT_FDCWD, "/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = -1 EMFILE (Too many open files) [pid 1640] writev(2, [{iov_base="Fatal glibc error: cannot get en"..., iov_len=53}], 1) = -1 EFBIG (File too large) [pid 1640] --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=1640, si_uid=104} --- [pid 1402] <... poll resumed>) = 1 ([{fd=7, revents=POLLIN|POLLHUP}]) ...

Here we see the ssh daemon start up as process 1402; a few hundred lines of strace's call log have been omitted here, as we're interested in what happens when it blocks waiting for connections. As we connect with an ssh client, the sshd wakes up process 1640 which was spawned as a handler.

The first thing the handler does is try to generate a random block of bytes to use for key exchange. According to the Phoronix page linked earlier:

The implementation is based on scalar Chacha20 with per-thread cache. It uses getrandom or /dev/urandom as fallback to get the initial entropy...

We see here that both the getrandom syscall and the urandom fallback failed. Without random data to start the key exchange, the handler crashes out. Now we know why we can't connect over ssh, what can we do?

Kernel versions

A good overview of the history of random number generators in Linux is this LWN article by Jake Edge, which states that the getrandom syscall was added in Linux 3.17; when combined with this quote from the manual page for the corresponding C function, we're led to a particular conclusion:

Errors

ENOSYS: The glibc wrapper function for getrandom() determined that the underlying kernel does not implement this system call.

So the kernel doesn't support getrandom. But I just upgraded to Bookworm, Debian 12, and I specifically saw linux-image-6.1.0 get installed...

# uname -a Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2+deb7u2 x86_64 GNU/Linux

In what can only be seen as a testament to the stability of Linux's syscall API, I've been running the original Wheezy kernel from ten years ago, through all the upgrades. But I've rebooted more than once in the intervening time, so even though I haven't rebooted since the upgrade, I should be running the 5.10 kernel bundled in with Bullseye?

In a stroke of genius from the admins at DigitalOcean, it's possible to set your VM to use a bootloader which isn't the system's default grub installation. As it turns out, that's how my VM was set up:

DO admin panel showing Debian 7.0 as the boot kernel — DigitalOcean admin panel, showing 'Debian 7.0' as the boot kernel

Swap it over to the recommended GrubLoader, reboot, and ssh suddenly started working again.

Another day set on fire

And so we lose another day to the vagaries of amateur systems administration: after six hours, lots of cursing, and at least two instances of "wait, what" being uttered, I got my mailserver back to the state it was in that morning, and received that verification email I was waiting for.

Good thing Nov 9th wasn't a workday... Oh wait.