Rpi5 with NVMe Base & Team MP33 NVMe performance problems

I am having an issue similar to this post - Rpi 5 with Pimoroni NVMe Base poor performance. I recently received 4 of the Pimoroni NVMe bases, and assembled all 4 of them with my Raspberry Pi 5s for a cluster build. I noticed that when I am booting off of the NVMe - I get terrible boot times and consistent I/O errors in syslog. And the system seems to hiccup constantly, taking a long time to respond to commands.

Part Details

  • 4xRaspberry Pi 5 8gb w/ Active Cooler
  • 4xPower Supplies - CanaKit 45W USB-C Power Supply with PD for Raspberry Pi 5 (27W @ 5A)
  • 4xPimoroni NVMe Base
  • 4xNVMe’s - Team MP33

Running Raspberry Pi OS Lite (64 bit) - Bookworm

Config (current)

pi@node1:~ $ rpi-eeprom-config
[all]
BOOT_UART=1
BOOT_ORDER=0xf146
POWER_OFF_ON_HALT=0
PCIE_PROBE=1

pi@node1:~ $ tail -2 /boot/firmware/config.txt
[all]
dtparam=nvme

I have tried with dtparam=pciex1 with and without the addition of dtparam=pciex1_gen=3, but all yield the same results.

Errors / Logs

Boot Times

pi@node1:~ $ systemd-analyze
Startup finished in 34.026s (kernel) + 43.463s (userspace) = 1min 17.489s
multi-user.target reached after 43.451s in userspace.

pi@node1:~ $ systemd-analyze blame
34.944s dev-nvme0n1p2.device
 7.080s NetworkManager-wait-online.service
 4.526s man-db.service
 2.230s bthelper@hci0.service
 2.226s dpkg-db-backup.service
 2.157s logrotate.service
 2.142s systemd-tmpfiles-clean.service
 1.113s raspi-config.service
  216ms sys-kernel-tracing.mount
...

Syslog Errors

Feb 23 06:40:06 node1 kernel: nvme nvme0: I/O 632 (I/O Cmd) QID 3 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 704 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 705 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 706 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 707 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 708 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 709 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: I/O 710 (I/O Cmd) QID 4 timeout, aborting
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0
Feb 23 06:40:08 node1 kernel: nvme nvme0: Abort status: 0x0

Troubleshooting

I have tried:

  • all logical combinations of the boot config with dtparam
  • updating firmware with rpi-update (on node1 only)
  • updating packages with apt-get update / apt-get upgrade
  • setting the boot order manually with rpi-eeprom-config --edit
  • setting the boot order via raspi-config

Later today I plan on taking the top node off and re-seating the ribbon cable and the NVMe, but I don’t believe this is the issue as I am experiencing this on all 4 of my nodes (and I would like to think I didn’t screw up 4 times).

Seeking advice for other things to try today to hopefully resolve this.

After reading some additional posts, I did also update my bootloader to use “latest” via raspi-config which had a significant improvement on my boot times (still not great), but it isn’t waiting 30+ seconds for the device anymore.

pi@node1:~ $ systemd-analyze
Startup finished in 34.569s (kernel) + 12.287s (userspace) = 46.857s
multi-user.target reached after 12.275s in userspace.

pi@node1:~ $ systemd-analyze blame
10.224s NetworkManager-wait-online.service
 2.178s systemd-tmpfiles-clean.service
 1.091s raspi-config.service
  765ms dev-nvme0n1p2.device
  496ms systemd-binfmt.service
  442ms systemd-timesyncd.service
  238ms networking.service

And I am still seeing the I/O errors in the logs.

I removed the NVMe base from one of the nodes (node4) and plugged the NVMe into a USB adapter and booted off of that. Performs great:


pi@node4:~ $ systemd-analyze
Startup finished in 4.914s (kernel) + 6.645s (userspace) = 11.560s
multi-user.target reached after 6.634s in userspace.
pi@node4:~ $ systemd-analyze blame
4.233s NetworkManager-wait-online.service
1.290s dev-sda2.device
1.085s raspi-config.service
 811ms bthelper@hci0.service
 265ms systemd-timesyncd.service

Also, I am no longer seeing any of the I/O errors.

I believe I have figured out the problem. It appears these hangups / disconnects were caused by the NVMe’s Autonomous Power State Transition (APSM) feature. I was able to see details about this feature after installing nvme-cli:

pi@node2:~ $ sudo nvme get-feature /dev/nvme0 -f 0x0c -H
get-feature:0x0c (Autonomous Power State Transition), Current value:00000000
	Autonomous Power State Transition Enable (APSTE): Enabled

In order to disable this feature, I added the following two items to the end of my /boot/firmware/cmdline.txt file:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

And when I rebooted everything was working as normal:

pi@node2:~ $ systemd-analyze
Startup finished in 3.189s (kernel) + 7.438s (userspace) = 10.628s
multi-user.target reached after 7.396s in userspace.

And all of the I/O errors disappeared.