[PATCH 21.02] ipq806x: backport cpufreq changes to 5.4
Ansuel Smith
ansuelsmth at gmail.com
Wed Sep 8 04:19:52 PDT 2021
Il giorno mer 8 set 2021 alle ore 02:11 Shane Synan
<digitalcircuit36939 at gmail.com> ha scritto:
>
> On 8/24/21 7:21 PM, Shane Synan wrote:
> > The fix hasn't been found, but progress has been made!
> >
> > After further testing, I think I've found a way to recreate this issue
> > with just the router itself, no external USB HDD, no Déjà Dup backup
> > over SFTP, and possibly no extra changes beyond a stock NBG6817
> > OpenWRT build (not confirmed as this router runs my home network,
> > including SQM QoS, VLANs with another WiFi AP, etc).
>
> So far, I've attempted all three suggested fixes, but I had trouble
> implementing one and I'm unsure if I tried the other two correctly.
> Additionally, pinning to "performance" for 1.75 GHz does not solve
> the issue either - more on that near the end.
>
> I've put all of my commits into one branch for easier reference:
> https://github.com/openwrt/openwrt/compare/master...digitalcircuit:ft-fix-ipq8065-reset
>
> And I've used my simplified automatic QA script for verification:
> https://github.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset#readme
>
> (In theory, anyone should be able to reproduce the issue with this
> script on a stock OpenWRT build. I'll still do testing with the
> Déjà Dup SFTP backup workload.)
>
>
> Suggestions and results in order of attempt:
> (Ignore "ipq8065: force CPUs to share DVFS scaling", wrong method.)
>
>
> 1. Raising clock latency (commits with "clock latency" in subject)
>
> I've tried raising the clock-latency-ns in the ipq8065 DTS by 1000000
> nanoseconds, a deliberately excessive value in the hopes of it being
> enough to notice any issues.
>
> I've tried this for...
>
> * 1.4 GHz and 1.75 GHz
> (ipq8065: raise 1.4 & 1.75 GHz clock latency)
> * All CPU frequencies
> (previous + ipq8065: raise all clock latency)
> * All CPU frequencies and L2 cache latency
> (previous + ipq8065: raise L2 cache, CPU core clock latency)
>
> Unfortunately, as noted in the revert commit, this seemed to have no
> impact on the results from the QA script.
>
> I don't know if I've correctly implemented this suggestion.
>
> QA script log on b1870c2 (.tar.xz due to 12.2 MiB uncompressed size):
> https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-30%2022-37-50%20-%20r17395-b1870c2530-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
>
>
> 2. Run both cores at the same frequency (most promising?)
>
> I tried to do this (ipq806x: Force CPU cores to share frequency), but
> I think I didn't modify the cpufreq driver in the correct way.
>
> As noted in the revert commit, this didn't appear to force CPUs to
> share frequency, whether manually using the performance governor or
> periodically observing the ondemand governor - the CPU cores were at
> different frequencies.
>
> I'll need help figuring out how to implement this in the cpufreq
> driver correctly. It seems promising given that in the past,
> dual-core bursty workloads didn't seem to trigger the crash.
>
> NOTE: Before diving into implementing this, read the conclusion below
> as I've noticed reboots happen without changing CPU frequency as well.
>
> I'm also not sure how to debug the cpufreq driver in general. With
> dynamic debugging, I can turn on messages about the cpufreq governor,
> but I'm not sure of the right way to add dynamic debugging print
> messages to the cpufreq driver.
>
> Example of dynamic debugging:
> echo "file drivers/cpufreq/* =p" > /sys/kernel/debug/dynamic_debug/control
>
> QA script log on 1fdabd9 (.tar.xz due to 4.4 MiB uncompressed size):
> https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-31%2020-38-04%20-%20r17397-1fdabd95db-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
>
>
> 3. Add forced frequency transitions between 1.0 GHz and 1.75 GHz
>
> I'm not sure if I implemented this correctly. I made a first attempt
> (ipq806x: Add transitions to 1.0 <> 1.4 <> 1.75 GHz), but if the
> frequency transitions happen, they're too fast to observe. And as
> noted above, I'm not yet sure of the right way to add dynamic
> debugging messages.
>
> Running the QA script in "case1" (toggle 800 MHz to 1.75 GHz) still
> crashes.
>
> QA script log on 52f4f77 (.tar.xz due to 471.8 KiB uncompressed size):
> https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-09-07%2019-58-07%20-%20r17399-52f4f77518-branch-ft-fix-ipq8065-reset%20-%20reboot%20-%20public.tar.xz
>
> Separately, I updated the QA script to add a "ramp1" case which
> smoothly ramps the CPU core frequency up/down from 600 MHz to
> 1.75 GHz, stopping at every frequency in between. Unfortunately,
> this still crashes.
>
> Interestingly, the crash again happens when CPU core frequencies are
> distant from each other (1.75 GHz and 800 MHz). This lends credence
> to the idea of locking CPU frequencies together for 1.4 and 1.75 GHz.
>
> QA script log on 11e9380 (.tar.xz due to 5.8 MiB uncompressed size):
> (11e938030d is from https://github.com/openwrt/openwrt/pull/4464 )
> https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20ramp1%20-%202021-09-01%2000-53-25%20-%20r17372-11e938030d-branch-fix-ipq8065-dts-opp-order%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz
>
>
> Observations and thoughts:
>
> My best guess involves having the CPUs match frequency at 1.4 GHz and
> above. However, using the "performance" governor at 1.75 GHz does
> NOT allow a full Déjà Dup backup to successfully complete - the
> router still hard-reboots after up to 8 hours of the intermittent
> single-core workload with both core frequencies pinned to 1.75 GHz.
>
> There may be a combination of issues - CPU frequency shifting and CPU
> voltage, perhaps?
>
> I may need to revisit raising the CPU voltage. I had increased it by
> 20000 microvolts in the past for all frequencies without success, but
> perhaps it needs raised even higher for 1.4 and/or 1.75 GHz..?
>
> Context for CPU voltage tests:
> https://github.com/openwrt/openwrt/compare/openwrt-21.02...digitalcircuit:openwrt-21.02-cpufreq-dtsivolt-cache-fix-opp-order
> (It's in the last two commits; the earlier commits are backporting.)
>
> I'll need to expand the QA test script to provide a simulated
> single-core bursty workload to see if I can make this aspect of the
> issue easier for others to reproduce. I planned to do this before
> sending this email, but other plans got in the way.
>
> A wild guess: instability triggered by having one CPU core draw power
> for a single-core workload while the other CPU core is idle..?
>
>
> With max frequency set to 1.0 GHz, I haven't observed instability
> jumping between 600 MHz and 1.0 GHz. Limiting 'scaling_max_freq' to
> 1.0 GHz is a slow workaround on stock firmware for anyone impacted by
> this in the way I am (e.g. not being able to run a complete backup).
> This is NOT a fix, just sharing in case others would have use for a
> workaround meanwhile.
>
> I'm unsure if I implemented clock latency or CPU transition
> frequencies correctly, and I know I didn't implement CPU frequency
> matching correctly.
>
>
> Once again, thank you for looking into this! I'll continue
> researching and tinkering meanwhile - I've not given up on this saga
> yet :)
>
> Regards,
> Shane Synan
Can you try all the above test with the:
- l2 scaling disabled
- l2 freq set to max
- cpu idle 300mhz never enabled?
Also a good thing to check would be take the divider clks (get the
regulator_summary
output and put in a file)
More information about the openwrt-devel
mailing list