[OpenWrt-Devel] [PATCH] ramips: gsw_mt7621: disable PORT 5 MAC RX/TX flow control by default
Kristian Evensen
kristian.evensen at gmail.com
Fri Feb 14 05:13:25 EST 2020
Hi everyone,
I am sorry for my late reply to this thread. My email provider flagged
it as spam, so I only saw the conversation now. It seems that you have
reached a conclusion on how to proceed, but I thought I should anyway
share my notes/observations on this issue (in case they can be
useful).
My employer has a large number of Mediatek-based (mt7620 and mt7621)
routers in production. Most routers have a minimum of two internet
connections - one fixed and one using mobile broadband. Some time in
2017 we started receiving reports from a few customers that the switch
would stop working. The link was up, but no data would go through.
Looking at the logs, we could always see the "TX timeout" error
message and we started to look for a cause.
We quickly ruled out any kind of crash, as the LTE was still up and
wifi worked fine. After getting a few of these reports, we started
looking for things that were common between the different
installations. We struggled to find any, there were all sorts of
devices connected to the different ports on the routers. The only
thing the different cases had in common, was that the problem
disappeared when whatever was connected to the WAN port was
disconnected. However, again, the equipment that provided the fixed
connection came from all sorts of vendors.
After scratching our heads for a while and not getting anywhere, I
asked here on the mailing list and was told that restarting networking
should at least make the switch works fine again. We added a watchdog
doing exactly that when the TX timeout message would appear.
Restarting networking improved the situation considerably, but the
switch would still sometimes get stuck and never recover.
This triggered us to make a second attempt at recreating the problem.
Our test was the same as what Rene described. We assumed the problem
had something to do with sending large amounts of traffic and at a
high speed, so we used iperf3 as a traffic generator and sent traffic
between different machines connected to the switch. One of these
machines were quite unstable and prone to crash, and we noticed that
whenever that machine would crash the TX timeout issue would trigger
and no traffic would pass through the switch.
A normal packet capture didn't reveal anything interesting, but
connecting a network tap did. Looking at the packets captured from the
tap, we could see a flood of pause frames from the crashed machine.
When this flood occurred, the switch stopped transmitting packets on
all the ports and not just the one that the crashed machine was
connected to. This caught us by surprise, but doing some research it
seems to be a common behavior among "normal" switches. Also, if we
waited long enough, the switch would never recover.
After discovering that pause frames seemed to be at least one trigger
for TX timeout, we added support to the driver for enabling/disabling
flow control on each of the ports + an init script that does the
disabling. Since we deployed this change on our routers, we have not
had a single report about switches that stop working. We do sometimes
still see the "TX timeout" error, but it is no longer critical.
We never tried to disable flow control on the CPU-port only, which
seems like a more elegant approach than disabling each port
individually. I do agree that disabling pause frames is more a
work-around than a solution, but it has at least eradicated the
problem for us. I never got around to submitting our patch, but if
anyone would find it useful I can do it quite soon.
BR,
Kristian
_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel
More information about the openwrt-devel
mailing list