[PATCH 21.02] ipq806x: backport cpufreq changes to 5.4
Shane Synan
digitalcircuit36939 at gmail.com
Tue Sep 7 17:11:25 PDT 2021
On 8/24/21 7:21 PM, Shane Synan wrote:
> The fix hasn't been found, but progress has been made!
>
> After further testing, I think I've found a way to recreate this issue
> with just the router itself, no external USB HDD, no Déjà Dup backup
> over SFTP, and possibly no extra changes beyond a stock NBG6817
> OpenWRT build (not confirmed as this router runs my home network,
> including SQM QoS, VLANs with another WiFi AP, etc).
So far, I've attempted all three suggested fixes, but I had trouble
implementing one and I'm unsure if I tried the other two correctly.
Additionally, pinning to "performance" for 1.75 GHz does not solve
the issue either - more on that near the end.
I've put all of my commits into one branch for easier reference:
https://github.com/openwrt/openwrt/compare/master...digitalcircuit:ft-fix-ipq8065-reset
And I've used my simplified automatic QA script for verification:
https://github.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset#readme
(In theory, anyone should be able to reproduce the issue with this
script on a stock OpenWRT build. I'll still do testing with the
Déjà Dup SFTP backup workload.)
Suggestions and results in order of attempt:
(Ignore "ipq8065: force CPUs to share DVFS scaling", wrong method.)
1. Raising clock latency (commits with "clock latency" in subject)
I've tried raising the clock-latency-ns in the ipq8065 DTS by 1000000
nanoseconds, a deliberately excessive value in the hopes of it being
enough to notice any issues.
I've tried this for...
* 1.4 GHz and 1.75 GHz
(ipq8065: raise 1.4 & 1.75 GHz clock latency)
* All CPU frequencies
(previous + ipq8065: raise all clock latency)
* All CPU frequencies and L2 cache latency
(previous + ipq8065: raise L2 cache, CPU core clock latency)
Unfortunately, as noted in the revert commit, this seemed to have no
impact on the results from the QA script.
I don't know if I've correctly implemented this suggestion.
QA script log on b1870c2 (.tar.xz due to 12.2 MiB uncompressed size):
https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-30%2022-37-50%20-%20r17395-b1870c2530-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
2. Run both cores at the same frequency (most promising?)
I tried to do this (ipq806x: Force CPU cores to share frequency), but
I think I didn't modify the cpufreq driver in the correct way.
As noted in the revert commit, this didn't appear to force CPUs to
share frequency, whether manually using the performance governor or
periodically observing the ondemand governor - the CPU cores were at
different frequencies.
I'll need help figuring out how to implement this in the cpufreq
driver correctly. It seems promising given that in the past,
dual-core bursty workloads didn't seem to trigger the crash.
NOTE: Before diving into implementing this, read the conclusion below
as I've noticed reboots happen without changing CPU frequency as well.
I'm also not sure how to debug the cpufreq driver in general. With
dynamic debugging, I can turn on messages about the cpufreq governor,
but I'm not sure of the right way to add dynamic debugging print
messages to the cpufreq driver.
Example of dynamic debugging:
echo "file drivers/cpufreq/* =p" > /sys/kernel/debug/dynamic_debug/control
QA script log on 1fdabd9 (.tar.xz due to 4.4 MiB uncompressed size):
https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-31%2020-38-04%20-%20r17397-1fdabd95db-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
3. Add forced frequency transitions between 1.0 GHz and 1.75 GHz
I'm not sure if I implemented this correctly. I made a first attempt
(ipq806x: Add transitions to 1.0 <> 1.4 <> 1.75 GHz), but if the
frequency transitions happen, they're too fast to observe. And as
noted above, I'm not yet sure of the right way to add dynamic
debugging messages.
Running the QA script in "case1" (toggle 800 MHz to 1.75 GHz) still
crashes.
QA script log on 52f4f77 (.tar.xz due to 471.8 KiB uncompressed size):
https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-09-07%2019-58-07%20-%20r17399-52f4f77518-branch-ft-fix-ipq8065-reset%20-%20reboot%20-%20public.tar.xz
Separately, I updated the QA script to add a "ramp1" case which
smoothly ramps the CPU core frequency up/down from 600 MHz to
1.75 GHz, stopping at every frequency in between. Unfortunately,
this still crashes.
Interestingly, the crash again happens when CPU core frequencies are
distant from each other (1.75 GHz and 800 MHz). This lends credence
to the idea of locking CPU frequencies together for 1.4 and 1.75 GHz.
QA script log on 11e9380 (.tar.xz due to 5.8 MiB uncompressed size):
(11e938030d is from https://github.com/openwrt/openwrt/pull/4464 )
https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20ramp1%20-%202021-09-01%2000-53-25%20-%20r17372-11e938030d-branch-fix-ipq8065-dts-opp-order%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
Observations and thoughts:
My best guess involves having the CPUs match frequency at 1.4 GHz and
above. However, using the "performance" governor at 1.75 GHz does
NOT allow a full Déjà Dup backup to successfully complete - the
router still hard-reboots after up to 8 hours of the intermittent
single-core workload with both core frequencies pinned to 1.75 GHz.
There may be a combination of issues - CPU frequency shifting and CPU
voltage, perhaps?
I may need to revisit raising the CPU voltage. I had increased it by
20000 microvolts in the past for all frequencies without success, but
perhaps it needs raised even higher for 1.4 and/or 1.75 GHz..?
Context for CPU voltage tests:
https://github.com/openwrt/openwrt/compare/openwrt-21.02...digitalcircuit:openwrt-21.02-cpufreq-dtsivolt-cache-fix-opp-order
(It's in the last two commits; the earlier commits are backporting.)
I'll need to expand the QA test script to provide a simulated
single-core bursty workload to see if I can make this aspect of the
issue easier for others to reproduce. I planned to do this before
sending this email, but other plans got in the way.
A wild guess: instability triggered by having one CPU core draw power
for a single-core workload while the other CPU core is idle..?
With max frequency set to 1.0 GHz, I haven't observed instability
jumping between 600 MHz and 1.0 GHz. Limiting 'scaling_max_freq' to
1.0 GHz is a slow workaround on stock firmware for anyone impacted by
this in the way I am (e.g. not being able to run a complete backup).
This is NOT a fix, just sharing in case others would have use for a
workaround meanwhile.
I'm unsure if I implemented clock latency or CPU transition
frequencies correctly, and I know I didn't implement CPU frequency
matching correctly.
Once again, thank you for looking into this! I'll continue
researching and tinkering meanwhile - I've not given up on this saga
yet :)
Regards,
Shane Synan
More information about the openwrt-devel
mailing list