Page 4 of 4 FirstFirst ... 234
Results 31 to 36 of 36

Thread: Two Complete System Freezes Within One Hour (Leap 15.2)

  1. #31
    Join Date
    Dec 2008
    Location
    FL, USA
    Posts
    2,601
    Blog Entries
    1

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    Quote Originally Posted by dickhack View Post
    No lightning strikes. No power failures recently, although they have happened in the past - can't remember the last one. I have a surge protector between the wall and the machine, but no UPS.
    Ordinary surge protectors sacrifice themselves to protect the equipment. That sacrifice commonly terminates the protection they had when new, without notice. A UPS is much better protection, besides keeping the machine running for some time when power goes out longer than mere seconds. A power surge long ago could have triggered eventual power supply instability that only recently manifested. If your PS is out of warranty, open it up for inspection for any obviously bad caps.

    At least RAM is easy to test. http://www.memtest86.com/ is what I recommend over memtest86+ for non-antique PCs, for at least the 4 iteration default, if not longer. Overnight (10+ hours) is better assurance for such a random failure as you're having.
    Reg. Linux User #211409 *** multibooting since 1992
    Primary: 15.1, TW, 15.2 & 13.1 on Haswell w/ RAID
    Secondary: eComStation (OS/2)&15.1 on i965P/Radeon
    Tertiary: TW,15.2,15.1,Fedora,Debian,more on Kaby Lake,iQ45,iQ43,iG41,iG3X,i965G,AMD,NVidia&&&

  2. #32

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    Quote Originally Posted by mrmazda View Post
    Ordinary surge protectors sacrifice themselves to protect the equipment. That sacrifice commonly terminates the protection they had when new, without notice. A UPS is much better protection, besides keeping the machine running for some time when power goes out longer than mere seconds. A power surge long ago could have triggered eventual power supply instability that only recently manifested. If your PS is out of warranty, open it up for inspection for any obviously bad caps.

    At least RAM is easy to test. http://www.memtest86.com/ is what I recommend over memtest86+ for non-antique PCs, for at least the 4 iteration default, if not longer. Overnight (10+ hours) is better assurance for such a random failure as you're having.
    I've considered using UPS in the past, but decided not to as 1) they're expensive, and 2) they need battery replacement, and 3) they need to be disposed of properly when they wear out, and 4) the most important - when I was maintaining them for my clients they proved highly unreliable in actually keeping a PC running in a power failure.

    I just downloaded the latest memtest86 and will run a test in a minute. If it shows nothing I'll try an overnight run.

  3. #33
    Join Date
    Dec 2008
    Location
    FL, USA
    Posts
    2,601
    Blog Entries
    1

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    Replacing batteries periodically is less hassle, and potentially cheaper, than replacing motherboards (& RAM & CPU to match if aged), power supplies, data, and time lost troubleshooting whatever's causing random lockups. I've been using them for three decades, and not just for computers.
    Reg. Linux User #211409 *** multibooting since 1992
    Primary: 15.1, TW, 15.2 & 13.1 on Haswell w/ RAID
    Secondary: eComStation (OS/2)&15.1 on i965P/Radeon
    Tertiary: TW,15.2,15.1,Fedora,Debian,more on Kaby Lake,iQ45,iQ43,iG41,iG3X,i965G,AMD,NVidia&&&

  4. #34

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    Quote Originally Posted by mrmazda View Post
    Replacing batteries periodically is less hassle, and potentially cheaper, than replacing motherboards (& RAM & CPU to match if aged), power supplies, data, and time lost troubleshooting whatever's causing random lockups. I've been using them for three decades, and not just for computers.
    Nonetheless, I prefer not to. I just can't justify the expense (pretty much a minimum of $200) and the hassle of dealing with them for an event which really is fairly rare unless one has a lot of crappy power. Most people get by nicely on surge protectors, and I have for the last two decades. I'm not spending $200 on insurance for a $900 box when I can replace ninety percent of main parts for the same or twice that money if *maybe* there's an issue *someday*. And then I have to replace the batteries every couple years at considerable expense. And again, my experience at my clients in the past were that the things are just not reliable. They're like "external NAS" boxes - complete **** that die within a year because the heat dissipation sucks and the electronics burn out.

    What I will do, however, is replace my current surge protector on the off chance that it may be old enough and worn out to be allowing surges to get through periodically. That precaution makes sense to me.

    Beyond that, I just ran Memtest86 for four full passes and no errors found. I'm not going to bother running it overnight as I view the chances of it picking up a memory error that only occurs every few days or a week is slight. Four passes is supposed to catch the vast majority of errors. At worst, I'll just replace the memory with new ones.

    I've also reset all the BIOS options back to what ASUS calls "optimized defaults". I went back in and reset my aggressive fan profile to keep the temperatures down but otherwise the BIOS, CPU and RAM speeds are all at factory. I did that before and It didn't stop the freezes, but we'll see. Maybe it will make the freezes even less frequent, which would be help.

    I'm also going to open the box, blow out the dust, including the memory slots, and check all the cables - the usual stuff I probably should have done before now.

    So my next step is to replace the surge protector. After that, if the freezes continue, I'll have to decide either to replace the power supply (a main suspect in hardware issues), replace the video card (another main suspect), do another complete reinstall, or just live with the freezes until I have to buy a new machine. I really don't want to spend the next stimulus check on buying a new machine (or a UPS).

    I still think the issue is something between the OS, the video card, and/or the AMD/ASUS hardware. Once again, I had *desktop* crashes relatively frequently under 15.1, which I doubt were hardware related but more likely video driver issues, but never hard system freezes. That started with 15.2 - which points the finger squarely at 15.2's interaction with my hardware. Which is why I'm really considering another total reinstall, PITA though that is.

    In general, living with the freezes if they're infrequent enough is better for me than trying to rip the machine apart on the off chance that there is some hardware failure happening. It's just annoying because it always happens when I'm in the middle of something.

  5. #35

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    OK, I think I found something.

    I had a freeze tonight at 2:18 AM Pacific time. I had just turned on my three docking stations which hold my six backup drives. The Most Recent Device popup just came up to show the drives mounted and the system froze solid as usual.

    This time I found a general protection fault in the logs. I'm not sure if this issue is the same as the freezing I've been having. Some research suggests that the Linux kernel has had issues with external USB 3 docking stations in the past - something to do with the XHCI suppport, whatever that is - so it might be related to that.

    This stuff started happening between 02:17:32 and 02:18:02 when the freeze happened:

    Code:
    Jan 23 02:17:25 localhost.localdomain udisksd[2273]: Mounted /dev/sdk1 at /run/media/rhack/Backup7 on behalf of uid 1000
    Jan 23 02:17:25 localhost.localdomain kernel: EXT4-fs (sdk1): mounted filesystem with ordered data mode. Opts: (null)
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: new SuperSpeed Gen 1 USB device number 17 using xhci_hcd
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: New USB device found, idVendor=174c, idProduct=55aa, bcdDevice= 1.00
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: New USB device strings: Mfr=2, Product=3, SerialNumber=1
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: Product: ASMT1153e
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: Manufacturer: asmedia
    Jan 23 02:17:32 localhost.localdomain kernel: usb 6-3: SerialNumber: 1234567897F6
    Jan 23 02:17:32 localhost.localdomain kernel: scsi host14: uas
    Jan 23 02:17:32 localhost.localdomain mtp-probe[8633]: checking bus 6, device 17: "/sys/devices/pci0000:00/0000:00:07.1/0000:0c:00.3/usb6/6-3"
    Jan 23 02:17:32 localhost.localdomain mtp-probe[8633]: bus: 6, device: 17 was not an MTP device
    Jan 23 02:17:32 localhost.localdomain kernel: scsi 14:0:0:0: Direct-Access     TOSHIBA  HDWE140          0    PQ: 0 ANSI: 6
    Jan 23 02:17:32 localhost.localdomain kernel: scsi 14:0:0:1: Direct-Access     TOSHIBA  MG03ACA400       0    PQ: 0 ANSI: 6
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: Attached scsi generic sg13 type 0
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: Attached scsi generic sg14 type 0
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] 4096-byte physical blocks
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: [sdm] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] Write Protect is off
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] Mode Sense: 43 00 00 00
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: [sdm] Write Protect is off
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: [sdm] Mode Sense: 43 00 00 00
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:0: [sdl] Optimal transfer size 33553920 bytes not a multiple of physical block size (4096 bytes)
    Jan 23 02:17:32 localhost.localdomain kernel: xhci_hcd 0000:0c:00.3: bad transfer trb length 504 in event trb
    Jan 23 02:17:32 localhost.localdomain kernel: sd 14:0:0:1: [sdm] Optimal transfer size 33553920 bytes
    Jan 23 02:17:32 localhost.localdomain kernel: general protection fault: 0000 [#1] SMP NOPTI
    Jan 23 02:17:32 localhost.localdomain kernel: CPU: 9 PID: 8166 Comm: kworker/u32:3 Tainted: G           O       5.3.18-lp152.60-default #1 openSUSE Leap 15.2
    Jan 23 02:17:32 localhost.localdomain kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 4207 12/07/2018
    Jan 23 02:17:32 localhost.localdomain kernel: Workqueue: events_unbound async_run_entry_fn
    Jan 23 02:17:32 localhost.localdomain kernel: RIP: 0010:kmem_cache_alloc_trace+0x90/0x270
    Jan 23 02:17:32 localhost.localdomain kernel: Code: 00 00 4d 8b 06 65 4d 8b 50 08 65 4c 03 05 70 4e 97 60 4d 8b 38 4d 85 ff 0f 84 a4 01 00 00 0f 1f 44 00 00 41 8b 5e 20 4c 01 fb <48> 33 1b 49 33 9e 70 01 00 00 49 8b 3e 49 8d 4a 01 4c 89 d2 4c 89
    Jan 23 02:17:32 localhost.localdomain kernel: RSP: 0018:ffffac4dc3073b50 EFLAGS: 00010202
    Jan 23 02:17:32 localhost.localdomain kernel: RAX: 0000000000000000 RBX: 2499d754c12f30a7 RCX: 0000000000000000
    Jan 23 02:17:32 localhost.localdomain kernel: RDX: 0000000000000200 RSI: 0000000000000cc0 RDI: ffff900707c06d80
    Jan 23 02:17:32 localhost.localdomain kernel: RBP: 0000000000000cc0 R08: ffff900a0ec72140 R09: 00000008ffffffff
    Jan 23 02:17:32 localhost.localdomain kernel: R10: 0000000001792e8f R11: 0000000000017580 R12: ffff900707c06d80
    Jan 23 02:17:32 localhost.localdomain kernel: R13: 0000000000000200 R14: ffff900707c06d80 R15: 2499d754c12f30a7
    Jan 23 02:17:32 localhost.localdomain kernel: FS:  0000000000000000(0000) GS:ffff900a0ec40000(0000) knlGS:0000000000000000
    Jan 23 02:17:32 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jan 23 02:17:32 localhost.localdomain kernel: CR2: 000055ffbacc8270 CR3: 00000001a040a000 CR4: 00000000003406e0
    Jan 23 02:17:32 localhost.localdomain kernel: Call Trace:
    Jan 23 02:17:32 localhost.localdomain kernel:  ? sd_revalidate_disk+0x97/0x1730
    Jan 23 02:17:32 localhost.localdomain kernel:  sd_revalidate_disk+0x97/0x1730
    Jan 23 02:17:32 localhost.localdomain kernel:  rescan_partitions+0x88/0x2c0
    Jan 23 02:17:32 localhost.localdomain kernel:  __blkdev_get+0x3c8/0x7f0
    Jan 23 02:17:32 localhost.localdomain kernel:  blkdev_get+0xd3/0x120
    Jan 23 02:17:32 localhost.localdomain kernel:  __device_add_disk+0x3e9/0x510
    Jan 23 02:17:32 localhost.localdomain kernel:  sd_probe+0x322/0x4a0
    Jan 23 02:17:32 localhost.localdomain kernel:  really_probe+0xef/0x430
    Jan 23 02:17:32 localhost.localdomain kernel:  ? driver_allows_async_probing+0x50/0x50
    Jan 23 02:17:32 localhost.localdomain kernel:  driver_probe_device+0x110/0x120
    Jan 23 02:17:32 localhost.localdomain kernel:  ? driver_allows_async_probing+0x50/0x50
    Jan 23 02:17:32 localhost.localdomain kernel:  bus_for_each_drv+0x69/0xb0
    Jan 23 02:17:32 localhost.localdomain kernel:  __device_attach_async_helper+0xad/0x100
    Jan 23 02:17:32 localhost.localdomain kernel:  async_run_entry_fn+0x37/0x140
    Jan 23 02:17:32 localhost.localdomain kernel:  process_one_work+0x1f4/0x3e0
    Jan 23 02:17:32 localhost.localdomain kernel:  worker_thread+0x2d/0x3e0
    Jan 23 02:17:32 localhost.localdomain kernel:  ? process_one_work+0x3e0/0x3e0
    Jan 23 02:17:32 localhost.localdomain kernel:  kthread+0x10d/0x130
    Jan 23 02:17:32 localhost.localdomain kernel:  ? kthread_park+0xa0/0xa0
    Jan 23 02:17:32 localhost.localdomain kernel:  ret_from_fork+0x22/0x40
    Jan 23 02:17:32 localhost.localdomain kernel: Modules linked in: nls_utf8 isofs loop snd_seq_dummy snd_hrtimer snd_seq snd_seq_device tun binfmt_misc fuse af_packet xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables vboxnetadp(O) x_tables vboxnetflt(O) bpfilter dmi_sysfs vboxdrv(O) msr nls_iso8859_1 snd_hda_codec_realtek nls_cp437 vfat edac_mce_amd snd_hda_codec_generic fat ledtrig_audio snd_hda_codec_hdmi kvm irqbypass crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec snd_hda_core aes_x86_64 snd_hwdep crypto_simd cryptd glue_helper snd_pcm eeepc_wmi asus_wmi sparse_keymap snd_timer rfkill igb video snd wmi_bmof ses pcspkr enclosure sp5100_tco mxm_wmi
    Jan 23 02:17:32 localhost.localdomain kernel:  k10temp ccp i2c_piix4 scsi_transport_sas soundcore dca gpio_amdpt gpio_generic button acpi_cpufreq sr_mod cdrom hid_logitech_hidpp hid_logitech_dj hid_generic usbhid uas usb_storage amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crc32c_intel xhci_pci xhci_hcd sata_sil24 usbcore nvme nvme_core wmi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
    Jan 23 02:17:32 localhost.localdomain kernel: ---[ end trace 617bcac25e37fdad ]---
    This repeats about ten times with the number in brackets in the "general protection fault: 0000 [#1] " line incrementing by 1 each time until it reaches 10. The PID referenced on the CPU line below that changes in each case, referencing various executing processes.

    I notice it references "asmedia", which I believe is the docking station controller that was associated with some of the Internet reports I looked at in the last hour which were related to kernel panics associated with docking stations.

    I also noticed this line:
    Jan 23 02:17:32 localhost.localdomain kernel: xhci_hcd 0000:0c:00.3: bad transfer trb length 504 in event trb

    Which doesn't look good and also seems to indicate a data transfer failure.

    Again, I don't know if this is related to my previous freezes as I don't recall seeing anything like this previously in the logs after my previous freezes. I suspect this may be a random glitch associated with the docking stations. But I've never had it happen before that the system froze on booting the external drives.

    Anyone got any ideas?

    Oh, BTW, I installed a new Tripp-Lite surge protector. I also opened the box, blew out any dust, and re-seated the memory modules. The BIOS remains at defaults except for my aggressive fan profile.

  6. #36
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,294
    Blog Entries
    1

    Default Re: Two Complete System Freezes Within One Hour (Leap 15.2)

    Quote Originally Posted by dickhack View Post
    OK, I think I found something.

    I had a freeze tonight at 2:18 AM Pacific time. I had just turned on my three docking stations which hold my six backup drives. The Most Recent Device popup just came up to show the drives mounted and the system froze solid as usual.

    This time I found a general protection fault in the logs. I'm not sure if this issue is the same as the freezing I've been having. Some research suggests that the Linux kernel has had issues with external USB 3 docking stations in the past - something to do with the XHCI suppport, whatever that is - so it might be related to that.

    ...

    This repeats about ten times with the number in brackets in the "general protection fault: 0000 [#1] " line incrementing by 1 each time until it reaches 10. The PID referenced on the CPU line below that changes in each case, referencing various executing processes.

    I notice it references "asmedia", which I believe is the docking station controller that was associated with some of the Internet reports I looked at in the last hour which were related to kernel panics associated with docking stations.

    I also noticed this line:
    Jan 23 02:17:32 localhost.localdomain kernel: xhci_hcd 0000:0c:00.3: bad transfer trb length 504 in event trb

    Which doesn't look good and also seems to indicate a data transfer failure.

    Again, I don't know if this is related to my previous freezes as I don't recall seeing anything like this previously in the logs after my previous freezes. I suspect this may be a random glitch associated with the docking stations. But I've never had it happen before that the system froze on booting the external drives.

    Anyone got any ideas?
    You may enable persistent journalling and filter errors, such as:

    Code:
    erlangen:~ # journalctl -p3 |grep drkonqi 
    Jan 03 14:53:05 erlangen drkonqi[17740]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 09 09:20:53 erlangen drkonqi[32594]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 11 13:39:18 erlangen drkonqi[14275]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 12 12:37:19 erlangen drkonqi[25678]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 15 06:02:25 erlangen drkonqi[17867]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 16 16:28:43 erlangen drkonqi[24316]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 21 12:34:30 erlangen drkonqi[7984]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 23 06:43:57 erlangen drkonqi[23198]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 23 06:45:30 erlangen drkonqi[2001]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    Jan 23 06:50:01 erlangen drkonqi[2019]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. 
    erlangen:~ #
    
    
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

Page 4 of 4 FirstFirst ... 234

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •