NVMe Base weird issues / crash?

Hello,

I’ve started running into some strange issues with my setup, and I can’t reliably trace them back to a specific cause. My suspicion is that it’s related to storage, but I’m not entirely sure, so I’m trying to shed more light on the problem.

My setup:

  • Raspberry Pi 5, 8GB model (soon upgrading to 16GB)

  • Official active cooler

  • Pimoroni NVMe Base

  • Lexar NM710 500GB NVMe

  • Official USB-C power supply

This setup has been stable for about a year and a half. A few months ago, I moved it to a different environment. The only real change was temperature (smaller room, rack-mounted, less airflow), but it ran fine for about 2–3 months after the move.

A few weeks ago, I noticed the Pi started acting up. When I investigated, I found that I couldn’t log into it. Most of my services were in a semi-crashed state, and it seemed like only the kernel and a few core components were working. For example, the Pi would still respond to pings and route WireGuard connections, but SSH logins would always fail.

I also saw that the CPU usage was abnormally high. From the partial monitoring data (via Netdata), I caught a glimpse of some stats: NFS client activity, page faults, and a large spike in I/O writes. CPU temperature was around 65 °C nad nvme was about 45°C

My theory is that the NVMe drive freezes or otherwise becomes unreadable, which would explain why services crash or return errors (e.g., 404s) and why SSH might fail (perhaps the Pi can’t access the authorized keys). I haven’t been able to gather proper logs, since I can’t connect once it gets into this state.

At this point, my best guesses are either temperature-related or power-related issues. But 45 °C doesn’t seem too high, and the official power supply should be sufficient. I’m running out of ideas, so any help or suggestions would be greatly appreciated.

I don’t think you have a temperature problem. But you should read this lengthy post: NVMe useful commands and disabling APST. - Raspberry Pi Forums

There have been problems before with the Lexar NM710, it might be worth searching explicitly for this device. Could be it goes to power-save mode and the OS cannot wake it up.

I would also recommend to turn on permanent journals, so you don’t loose this important information at reboot. And/or send the logs automatically to a second machine. Or write them to a SD-card/USB-device. Without logs, everything is guesswork.

Thanks for the response. I have disabled APST and enabled persistent journals. I will keep an eye on it and see how it behaves. I will also read again through everything and try to find device specific problems. I will update if it happends again or at the end of a week if no one replies some other ideas.

1 Like

Even with APST disabled, it happened again.

For more context i was looking at the old log it just stopped / got cut after this line, with nothing unusual before it:

Sep 28 18:31:55 raspberrypi dhcpcd[753]: eth0: requesting DHCPv6 information

Which does not tell me anything. Same behavior as last time. Everything remained the same as before. Only two notable things have happened: I closed the rack door and safely exposed port 443 only to Cloudflare IPs. My web services are well secured and the logs also don’t show anything suspicious. I looked into the partial data and the NVMe temperature was reported at 53 °C.

Looking through the journal I found only this:

Sep 22 19:08:07 raspberrypi kernel: Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe cgroup_disable=memory numa_policy=interleave numa=fake=8 system_heap.max_order=0 smsc95xx.macaddr=<redac> vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000 console=ttyAMA10,115200 console=tty1 root=PARTUUID=74560cdb-02 rootfstype=ext4 fsck.repair=yes rootwait cfg80211.ieee80211_regdom=CZ nvme_core.default_ps_max_latency_us=0
Sep 22 19:08:07 raspberrypi kernel: nvme nvme0: pci function 0001:01:00.0
Sep 22 19:08:07 raspberrypi kernel: nvme 0001:01:00.0: enabling device (0000 → 0002)
Sep 22 19:08:07 raspberrypi kernel: nvme nvme0: missing or invalid SUBNQN field.
Sep 22 19:08:07 raspberrypi kernel: nvme nvme0: failed to allocate host memory buffer.
Sep 22 19:08:07 raspberrypi kernel: nvme nvme0: 4/0/0 default/read/poll queues
Sep 22 19:08:07 raspberrypi kernel: nvme0n1: p1 p2
Sep 22 19:08:07 raspberrypi kernel: EXT4-fs (nvme0n1p2): mounted filesystem 93c89e92-8f2e-4522-ad32-68faed883d2f ro with ordered data mode. Quota mode: none.
Sep 22 19:08:07 raspberrypi kernel: EXT4-fs (nvme0n1p2): re-mounted 93c89e92-8f2e-4522-ad32-68faed883d2f r/w. Quota mode: none.
Sep 22 19:08:07 raspberrypi systemd-fsck[395]: /dev/nvme0n1p1: 400 files, 32849/261115 clusters
Sep 22 19:08:08 raspberrypi systemd[1]: nvmefc-boot-connections.service - Auto-connect to subsystems on FC-NVME devices found during boot was skipped because of an unmet condition check (ConditionPathExists=/sys/class/fc/fc_udev_device/nvme_discovery).
Sep 22 19:08:08 raspberrypi sensors[688]: nvme-pci-10100
Sep 22 19:08:14 raspberrypi systemd[1]: Starting nvmf-autoconnect.service - Connect NVMe-oF subsystems automatically during boot…
Sep 22 19:08:14 raspberrypi modprobe[1107]: modprobe: FATAL: Module nvme-fabrics not found in directory /lib/modules/6.12.20+rpt-rpi-2712
Sep 22 19:08:14 raspberrypi systemd[1]: nvmf-autoconnect.service: Control process exited, code=exited, status=1/FAILURE
Sep 22 19:08:14 raspberrypi systemd[1]: nvmf-autoconnect.service: Failed with result ‘exit-code’.
Sep 22 19:08:14 raspberrypi systemd[1]: Failed to start nvmf-autoconnect.service - Connect NVMe-oF subsystems automatically during boot.
Sep 22 19:08:18 raspberrypi kernel: nvme nvme0: using unchecked data buffer
Sep 22 19:09:05 raspberrypi sudo[7788]: depstr : TTY=pts/0 ; PWD=/home/depstr ; USER=root ; COMMAND=/usr/sbin/nvme get-feature /dev/nvme0 -f 0x0c -H

Have you read already the error-log? The nvme has an error-log subcommand and you could also try smartctl.

Right after writing the post above it happend again. I remained logged in in ssh here is some output. But its not gonna be any help i think. It happend within 15minutes after reboot

bash: /usr/bin/journalctl: Input/output error bash: /usr/bin/dmesg: Input/output error bash: /usr/bin/htop: Input/output error

I am going to reboot now, keep live dmesg and journal on the second monitor, i am also gonna check the smartctl.

I have a NM620 which I planned for a Pi5 home-server. I will setup a test-system and let it run 24/7. Maybe I can reproduce your problem, but I am not sure how similar these devices really are. And this will take a while, since I have a number of more urgent tasks to do.

What I can suggest: add a SD-card (or USB-drive), and write the logs there instead of writing to the NVMe. In this case maybe there is more information. You could also increase the verbosity level of the kernel.

Just as i expected it happend again, this time i have stayed up recording dmesg.
This time, there is much more info.

[58893.722355] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[58893.722364] nvme nvme0: Does your device have a faulty power saving mode enabled?
[58893.722367] nvme nvme0: Try “nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off” and report a bug
[58893.806373] nvme 0001:01:00.0: enabling device (0000 → 0002)
[58893.812291] nvme nvme0: failed to allocate host memory buffer.
[58893.827332] nvme nvme0: 4/0/0 default/read/poll queues

[58912.209409] nvme0n1: I/O Cmd(0x1) @ LBA 35395952, 32 blocks, I/O Error (sct 0x0 / sc 0x6)
[58912.209422] I/O error, dev nvme0n1, sector 35395952 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[58912.209429] EXT4-fs warning (device nvme0n1p2): ext4_end_bio:342: I/O error 10 writing to inode 4132659 starting block 4424494)
[58912.209438] Buffer I/O error on device nvme0n1p2, logical block 4292398
[58912.209441] Buffer I/O error on device nvme0n1p2, logical block 4292399
[58912.209454] Buffer I/O error on device nvme0n1p2, logical block 4292400
[58912.209457] Buffer I/O error on device nvme0n1p2, logical block 4292401
[58912.513101] nvme0n1: I/O Cmd(0x1) @ LBA 3262616, 112 blocks, I/O Error (sct 0x0 / sc 0x6)
[58912.513116] I/O error, dev nvme0n1, sector 3262616 op 0x1:(WRITE) flags 0x9800 phys_seg 14 prio class 2
[58912.513180] Aborting journal on device nvme0n1p2-8.
[58912.513288] EXT4-fs error (device nvme0n1p2) in ext4_reserve_inode_write:5801: Journal has aborted
[58912.513348] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:84: comm containerd-shim: Detected aborted journal
[58912.513555] EXT4-fs error (device nvme0n1p2): ext4_journal_check_start:84: comm systemd-journal: Detected aborted journal
[58912.556406] nvme0n1: I/O Cmd(0x1) @ LBA 1056784, 32 blocks, I/O Error (sct 0x0 / sc 0x6)

[58913.380393] EXT4-fs warning (device nvme0n1p2): ext4_end_bio:342: I/O error 10 writing to inode 148022 starting block 417963)
[58913.466728] EXT4-fs warning (device nvme0n1p2): ext4_end_bio:342: I/O error 10 writing to inode 148022 starting block 421614)
[58917.586170] nvme_log_error: 50 callbacks suppressed
[58917.586177] nvme0n1: I/O Cmd(0x1) @ LBA 3153920, 8 blocks, I/O Error (sct 0x0 / sc 0x6)
[58917.586185] blk_print_req_error: 50 callbacks suppressed
[58917.586187] I/O error, dev nvme0n1, sector 3153920 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[58917.586193] buffer_io_error: 9 callbacks suppressed

[58918.243321] EXT4-fs (nvme0n1p2): I/O error while writing superblock[58918.243328] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[58918.243395] EXT4-fs (nvme0n1p2): Remounting filesystem read-only

Full log:

The first three lines are relevant. The others are just errors resulting from the nvme0 controller reset.

This seems to be a very common error, especially with Lexar SSDs, but not only. Have you tried adding all the kernel-tweaks that the message is suggesting?

Independent of this, run sudo smartctl -a /dev/nvme0. You can run it any time. There is also a self-test mode which you can start, it will run asynchronously.

Are you doing IO on a regular basis, or is the disk idle for a longer time period? In the former case, it is hard to believe that it is a power-management issue. But in the end, it might just be a bug.

Other things to try: update the kernel, maybe update the SSD firmware. Or issue commands to the controller in regular intervals in the hope this prevents the controller to go down, something like sudo nvme fw-log /dev/nvme0.

Despite adding the kernel tweaks the issue came back quite fast.Here is the smartctl output.

=== START OF INFORMATION SECTION ===Model Number:                       Lexar SSD NM710 500GBSerial Number:                      NFS401R003900P2200Firmware Version:                   9742PCI Vendor/Subsystem ID:            0x1d97IEEE OUI Identifier:                0xcaf25bTotal NVM Capacity:                 500,107,862,016 [500 GB]Unallocated NVM Capacity:           0Controller ID:                      0NVMe Version:                       1.4Number of Namespaces:               1Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]Namespace 1 Formatted LBA Size:     512Namespace 1 IEEE EUI-64:            caf25b 031000003cLocal Time is:                      Sat Oct  4 19:16:37 2025 CESTFirmware Updates (0x16):            3 Slots, no Reset requiredOptional Admin Commands (0x0017):   Security Format Frmw_DL Self_TestOptional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_FeatLog Page Attributes (0x02):         Cmd_Eff_LgMaximum Data Transfer Size:         128 PagesWarning  Comp. Temp. Threshold:     90 CelsiusCritical Comp. Temp. Threshold:     95 Celsius
Supported Power StatesSt Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat0 +     6.50W       -        -    0  0  0  0        0       01 +     5.80W       -        -    1  1  1  1        0       02 +     3.60W       -        -    2  2  2  2        0       03 -   0.0500W       -        -    3  3  3  3     5000   100004 -   0.0025W       -        -    4  4  4  4     8000   45000
Supported LBA Sizes (NSID 0x1)Id Fmt  Data  Metadt  Rel_Perf0 +     512       0         0
=== START OF SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)Critical Warning:                   0x00Temperature:                        46 CelsiusAvailable Spare:                    100%Available Spare Threshold:          10%Percentage Used:                    1%Data Units Read:                    4,104,889 [2.10 TB]Data Units Written:                 4,242,903 [2.17 TB]Host Read Commands:                 49,245,139Host Write Commands:                122,582,328Controller Busy Time:               271Power Cycles:                       117Power On Hours:                     7,886Unsafe Shutdowns:                   39Media and Data Integrity Errors:    0Error Information Log Entries:      31Warning  Comp. Temperature Time:    0Critical Comp. Temperature Time:    0Temperature Sensor 1:               46 CelsiusTemperature Sensor 2:               42 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)No Errors Logged

I am not sure how i would describe the IO activity. But i would say its regular.. It isnt used very much / has very few apps running. The notable ones are Nextcloud and Plex.. But the both read from NAS. Meaning probably only the DBs. And also a few other things. But nothing big. During the night i would say its not used at all. As for the power.. Its behind surge protected UPS and PDU with the official power supply.

Right now, i update everything i could. OS, kernel, except the ssd firmware.
I will see how it will behave for now.
My other options are:

  1. Try anothe SSD; I have a small one laying around..
  2. Swap the Pi itself… Which will happend sooner or later,moving to 16GB Model.

In the log there is nothing obvious. So I think you tried everything.

One problem remains: there are other possible error-sources. Like the FPC-cable or the NVMe-base itself. PCIe was not designed for this kind of setup. So another step would be to either put the SSD in an USB-enclosure and boot from there and see what happens, or to swap the NVMe-base.
But I am only speculating…

I will try to reseat the FPC-Cable, it happend again. Next i will probably swap it to a different PI and the just replace the SSD.. Nothing else i can think off…

Please keep me posted. This bug/problem could also hit me once I set up my Lexar. Thanks and good luck!

I cleanded the whole system, it was abit dusty and in the process i reseated the flex cable. But it appears that made it worse… Could it be the cable?

Could be, other sources could be the connectors. Pimoroni has replacement cables, but shipping is probably ten times the amount the cable costs. At least to Germany.

I have ordered it, cost me 4 euro. There is a verified supplier in my country so it’s cheap and fast.
I will also clone the SSD to a spare one. After that, I’ll test everything next weekend to see how it goes. We will hopefully finally know where the issue lies.

1 Like

Currently running on a cloned M.2 SSD with the same data as the Lexar. It’s an SK Hynix drive. Uptime is currently 3 days, and it seems stable so far. I’ll continue to monitor it. I haven’t even swapped the PCIe cable yet.

Thanks, sounds good.

I’m back with over one month of successful uptime (1 month, 10 days, and 3 hours). So far, there have been absolutely no issues. I’d say that replacing the Lexar with a different drive (SK Hynix in my case) seems to be the solution. (Not sure if I mentioned it before, but the Lexar drive was new, and so is the SK Hynix.)