This site is served out of a DigitalOcean droplet that I've had since perhaps 2013, which was set up with Debian 7.0 Wheezy; the same machine is my mailserver. The system has been upgraded over the years, all the way up to Debian 11 (Bullseye), and has always been fun to try to keep up to date.
It starts so innocently...
It was just after Halloween 2023, and I was signing up to some SaaS provider's service. The site popped up a message saying I'd be receiving a verification email with a thing to click on, and then I could log in. Fairly normal signup behaviour for a site at this point.
Except the email never arrived. I headed over to mail.log
to see what Postfix was saying...
Excerpt from /var/log/mail.log
postfix/smtpd[1141504]: connect from m206-43.eu.mailgun.net postfix/smtpd[1141504]: SSL_accept error from m206-43.eu.mailgun.net: -1 postfix/smtpd[1141504]: warning: TLS library problem: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543: SSL alert number 42: postfix/smtpd[1141504]: lost connection after STARTTLS from m206-43.eu.mailgun.net
Well, that's weird. A little searching around reveals a thread on StackExchange where a helpful dave_thompson_085
has this advice:
Receiving alert bad certificate (code 42) means the server demands you authenticate with a certificate, and you did not do so, and that caused the handshake failure...
Find a certificate issued by a CA in the 'acceptable' list...
So there's some kind of problem with the other end's CA. But it's Mailgun, you'd expect them to have a handle on keeping things up to date; let's check what their certificate looks like.
$ openssl s_client -showcerts -connect api.mailgun.net:443 CONNECTED(00000005) depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root G2 verify return:1 depth=1 C = US, O = DigiCert Inc, CN = DigiCert Global G2 TLS RSA SHA256 2020 CA1 verify return:1 depth=0 C = US, ST = Texas, L = San Antonio, O = "MAILGUN TECHNOLOGIES, INC", CN = *.api.mailgun.net verify return:1 ...
This isn't the certificate used by Mailgun when attempting to send that email, but it is a certificate used by Mailgun, and it's signed by the DigiCert CA. If we check with DigiCert, we see that the 2020 CA certificate has been superceded as of March 2023, so that's probably our problem: our copy of DigiCert's CA is too old.
No problem, we just update it:
# apt search ca-certificates Sorting... Done Full Text Search... Done ca-certificates/stable,now 20210119 all [installed] Common CA certificates
Ah, right, yes. Bullseye has certificates up to 2021, and any CAs newer than that are in Debian 12 (Bookworm). And this, as I tooted at the time, is where things go awry:
And now we have a problem
After the upgrade to Bookworm, connections over ssh would drop immediately:
$ ssh imrannazar.com Connection closed $ ssh -v imrannazar.com OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 47: Applying options for * ... debug1: SSH2_MSG_KEXINIT sent Connection closed $
So what's going on here? Evidently something about the server's new version of ssh has broken incoming connections... Let's get into the VM's console to start a debug copy of ssh on a different port, and see what happens.
Running a debug sshd on the server
# /usr/sbin/sshd -D -d -o Port=9001 ... Missing privilege separation directory: /run/sshd # mkdir /run/sshd # /usr/sbin/sshd -D -d -o Port=9001
Connecting to the debug sshd
$ ssh -p9001 imrannazar.com Connection closed $
At least it's consistently crashing. Flipping back to the VM console, we get:
Aha, a lead! This error looks suspiciously like it would cause things to fall over in a heap:
Fatal glibc error: cannot get entropy for arc4random
arc4random and the getrandom syscall
Why would this suddenly become a problem in Debian 12, when it wasn't a thing before? As it turns out, arc4random
is a method of generating random numbers that's been in OpenBSD for years, but only made it to glibc (and thus to OpenSSH on Linux) in July 2022. My previous Bullseye kernel was up to date as of Jan 2021, so it makes sense that my previously installed OpenSSH didn't have the support required.
Now, what is the ssh daemon doing to get this error? A useful way of finding out which system calls are being made by a given program is to run it through strace
:
# strace -f /usr/sbin/sshd -D -d -o Port=9001 ... [pid 1402] poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 2, -1 <unfinished ...> [pid 1640] <... write resumed>) = 81 [pid 1640] getrandom(0x7f721688bd90, 16, 0) = -1 ENOSYS (Function not implemented) [pid 1640] openat(AT_FDCWD, "/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = -1 EMFILE (Too many open files) [pid 1640] writev(2, [{iov_base="Fatal glibc error: cannot get en"..., iov_len=53}], 1) = -1 EFBIG (File too large) [pid 1640] --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=1640, si_uid=104} --- [pid 1402] <... poll resumed>) = 1 ([{fd=7, revents=POLLIN|POLLHUP}]) ...
Here we see the ssh daemon start up as process 1402; a few hundred lines of strace's call log have been omitted here, as we're interested in what happens when it blocks waiting for connections. As we connect with an ssh client, the sshd wakes up process 1640 which was spawned as a handler.
The first thing the handler does is try to generate a random block of bytes to use for key exchange. According to the Phoronix page linked earlier:
The implementation is based on scalar Chacha20 with per-thread cache. It uses getrandom or /dev/urandom as fallback to get the initial entropy...
We see here that both the getrandom
syscall and the urandom
fallback failed. Without random data to start the key exchange, the handler crashes out. Now we know why we can't connect over ssh, what can we do?
Kernel versions
A good overview of the history of random number generators in Linux is this LWN article by Jake Edge, which states that the getrandom
syscall was added in Linux 3.17; when combined with this quote from the manual page for the corresponding C function, we're led to a particular conclusion:
Errors
ENOSYS: The glibc wrapper function for getrandom() determined that the underlying kernel does not implement this system call.
So the kernel doesn't support getrandom
. But I just upgraded to Bookworm, Debian 12, and I specifically saw linux-image-6.1.0
get installed...
# uname -a Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2+deb7u2 x86_64 GNU/Linux
In what can only be seen as a testament to the stability of Linux's syscall API, I've been running the original Wheezy kernel from ten years ago, through all the upgrades. But I've rebooted more than once in the intervening time, so even though I haven't rebooted since the upgrade, I should be running the 5.10 kernel bundled in with Bullseye?
In a stroke of genius from the admins at DigitalOcean, it's possible to set your VM to use a bootloader which isn't the system's default grub
installation. As it turns out, that's how my VM was set up:
Swap it over to the recommended GrubLoader, reboot, and ssh suddenly started working again.
Another day set on fire
And so we lose another day to the vagaries of amateur systems administration: after six hours, lots of cursing, and at least two instances of "wait, what" being uttered, I got my mailserver back to the state it was in that morning, and received that verification email I was waiting for.
Good thing Nov 9th wasn't a workday... Oh wait.