Glen Pitt-Pladdy :: BlogHunting latency on OpenWrt | |||
With the combination of a OpenWrt router/firewall and my current Cacti ping (latency) graphs it was becoming obvious that the latency of the OpenWrt device had some unusually large peak latency for a usually idle router. LoggingAs with most thing, I tend to start by seeing if there are any clues in the logs. While there was nothing obvious, at the exact times that there was high latency I found bursts of log activity. Logging is done to a RAM filesystem so while that is busy, everything has to wait. I have fairly comprehensive logging on the device and didn't want to turn that down so the other option is to offload it to another server. I am running rsyslog on the server so it needed a little extra config. Ideally I also want to keep the OpenWrt device logs in their own file. To keep things neat I created a new config for the router on the server that is to keep the logs - say /etc/rsyslog.d/local-openwrt.conf with the following config:
# log via udp This essentially enables log input via UDP on port 514 (standard syslog port) and puts all the logs from the hostname "OpenWrt" in the file /var/log/openwrt-messages, and then excludes those lines from further logging else they would also end up in the other logs on the server. This config is almost exactly as discussed on the rsyslog examples. To make rsyslog listen on the port requires a full restart. Also remember to open any further packet filtering (eg. on the server) to allow connection from the OpenWrt device. Next we need to configure the OpenWrt device to log remotely. I am running syslogd-ng on it so that requires a few changes to /etc/syslog-ng.conf to enable remote logging. I added a new destination:
destination remote { And then change the destination line in the log section to use the remote log: destination(remote); Restart syslog-ng on the OpenWrt device and all things being correct you should see logs going into the logfile on the server instead of the device. The impact on latency is quite obvious. I made the change shortly after 18:30 and the change in the latency graph clearly reflects this:
The improvement is obvious and especially considering that from about 19:00 the router was shifting a lot of traffic (backups) for a couple hours without significant impact, but there is still more that can be gained, so time to dig further into the latency.... Log bufferingIt turns out that the logs are still showing bursts of logging (eg. port scanning activity) exactly matching the times of the poor latency. One thing I have experimented with is greatly increasing the log FIFO size, but this has made no significant difference. No loggingThe remaining question is how much impact logging is having on the latency. As an experiment I tried disabling logging all together.
There is definitely a slight change in latency, from about 18:00 when the change was made with the high peaks gone, but certainly not the medium (about 4ms) peaks. Those are clearly caused by something else. It does seem that even with a large (64k) FIFO and remote logging, firewall logging is bad when it comes to latency. The best approach I can think of to solve this is to be more selective about logging and removing a lot of of regular stuff like duplicate ACKs (symptom of packet loss) without logging them. Possibly also FIN packets (connection closes) which are also prone to duplicates and worse those often come in bursts when connections to a site are closed when browsing onto another site. The thing that remains is to find the cause of the remaining mid-height (around 4ms) spikes. Digging deeperExamining the ping logs manually, the latency is not periodic (ie. it isn't some cron job or Cacti polling). There is nothing in the logs that seems to match. Without logging, it's difficult to match anything firewall related, and there is no obvious peaks in traffic - in fact many of the peaks are happening at quiet times. Stopping dnsmasq and ntpclient has made no noticable difference. There is very little other than kernel running on the router: watchdog, httpd, dropbear (ssh), hotplug2, cron, and syslog-ng. None of those things are likely to be doing much at the time of the latency - all of them are going to be almost always idle or operating periodically which would show up. What's left?I suspect what remains is just normal kernel related latency. As this is a single CPU (and a small one at that) any atomic operation has to complete before the CPU is available to do other stuff and threads don't actually run concurrently with the single CPU - they all have to wait their turn. That probably also means that many routine things like servicing hardware interrupts (eg. timers), memory management, syncing filesystems, and more could cause some waits for the network stack. I've come across other similar platforms from big router brands delivering far higher average latency with their OS' so really it's probably a bit unfair given that it's averaging only about double what big quad-core desktop system manages in the way of latency. |
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments: