Glen Pitt-Pladdy :: BlogLinux Netfilter and Window Scaling | |||
I've recently been handling deployment of new web servers to data centres and in testing ran into some problems with one data centre. The support staff at the data centre NOC where helpful, but where not getting to the root of the problem. I had thoroughly researched and reviewed our options for data centres and knew that this data centre was among the best in the UK, so what was going on? Random Connections StallingThe symptoms that I was continually seeing where that some connections where stalling apparently randomly. It was quite common to download nearly 1MB of data and then the connection suddenly stall and not recover (If I was patient enough to wait I guess it would time out or drop the connection). We have secondary equipment (in case of a major failure with the primary servers) with identical hardware and software (in fact I automated the server install and configuration with a script that does all the relevant configuration on each server) which is not showing the same problems. The other interesting thing was that it was that after trying to find clues by statistical analysis, it came out that about half of connections stalled, irrespective of the download size which lead me to suspect it the problem was with some upstream device that was processing data at TCP level. I had checked the problem from several different ISPs, connection types, client platforms an network layouts and it was persistent. MTU ProblemsThe first route we went down on the suggestion of the NOC staff was to check for problems with MTU. This is something that has bitten me in the past on heavily secured networks where the firewall configs where inadvertently blocking ICMP type 3 code 4 which is sent back when packets are too big for the network and can't be fragmented (eg. fragmentation is disabled on a router or DF is set on an IP packet). This was definitely a likely cause. This was thoroughly investigated and dumps of network traffic where done (tcpdump / tshark / wireshark are your friends!) which showed no ICMP traffic. Another test is to try maximum size pings both ways through the link: ping -c 1 -M do -s 1472 targethostname This sends an ICMP echo request (ping) containing 1472 bytes (makes the IP packet 1500 bytes with all the headers) to the target but requests that the packet is not fragmented (split into multiple packets), and if a response comes back then you know it reached it's destination safely. Then also test in the reverse direction to to ensure that it is clean both ways (the reply packets don't necessarily have the DF / Don't Fragment flag set). Also test with 1473 bytes and make sure you do get it failing. If your network MTU is different (eg. where there is a tunnel in use) then adjust the sizes accordingly. Obviously, if you have a network that supports larger frames then the ping size can be increased accordingly. If you go over the limit then the node that the limit is hit on should reply saying that it the packet is too large with a message like "Frag needed and DF set" One thing that I have seen several times with this is problems between a Gigabit network with Jumbo Frames (typically MTU 9000) going into a 100Mbit network (MTU 1500). They both work within their own network segments, but run into trouble across segments if the ICMP error responses are not being generated or are being blocked. In my case everything I did to test for MTU problems checked out fine. NIC OffloadingOne thing that can make diagnosis of these problems more difficult is that many network chipsets support IP offloading. This means that the network chipset will do at least part of the work to construct the outgoing IP packet and queue it for transmission (hardware assistance). This reduces the load on the CPU for assembling packets. Where this causes difficulties is that when capturing packets for diagnostics, the packets captured are not necessarily those transmitted on the wire. In some cases the network chipset may be accepting whole segments of data and dicing it up into packets on it's own which can make captures look like insanely long packets are being transmitted. The most common thing is simply outgoing packets being marked as having bad checksums. Offloading checksumming is almost universally done in the NIC chipset. To disable these you can use the ethtool utility. See the man page for the options that can be used to view and disable all the offloading. By disabling any offloading you get a clear picture of what is going down the wire when capturing traffic. Still no luck...After trying all the diagnostics that we could think of we where still no wiser. The curious thing was that it was about 50% of connections that where running into trouble. Larger transfers failed with the same regularity as small ones. To me this suggested it was not a straight forward packet loss problem - larger transfers would have had a higher chance of running into trouble if this was simply lost of corrupted packets. My suspicion was the upstream firewall at the datacentre. I have previously come across probelms with firewalls getting it wrong when sanitising traffic which resulted in dropped connections and connection stalls. Previously I have had to disable many sanity checks on security devices which where blocking far more than the specific things they where supposed to. The NOC guys bounced it back - they have thousands of servers in the data centre which are not having problems: we should try disabling the firewall (Netfilter) we where running on the server. Well, thousands of other people without problems is no longer a surprise to me. Few people will be doing the level of filtering on traffic that we are, and thousands of people previously where also not thorough enough to spot that there where gaping errors in their and other data centre contracts we had reviewed (eg. references missing, calculations that didn't stack up and much more). .... and disabling the server firewall worked. But why? Layers of securityI have learned to run multiple layers of security on anything I deploy. The fundamental reason is that when I do, I often find gaping holes in one layer. It was no different with the data centre's firewall - when I first tested it it was obvious that it wasn't doing anything. That turned out to be a misconfiguration which the NOC team had to correct. The other thing that few people consider, is although there may be a firewall between your server and the outside world, you will still be at the very least sharing a network segment with potentially thousands of other servers who have direct access to your server on the same network segment. Compromised servers are common and attacks could easily be mounted against unprotected servers from peers on the network segment. Simply turning off the kernel firewall and relying only on the data centre firewall is not an option for me given past experience. With the number of compromised systems around, it is almost certain that there will be a few on the same network segment, and many which are vulnerable to being compromised. By logging any packets on the server firewall, it shows how effective the data centre firewalls are, and can give valuable warning if an attack is under way from a peer on it's network segment. This is also how I knew that the data centre's firewall was not offering us any protection - every packet fired at our server got logged and reported to me via monitoring systems. Digging deeperDetermined to figure out what was causing the problems (otherwise how do I know that this isn't going to cause more problems later when we go live and have thousands of people pounding on the server), I started digging into this. Dumps of network traffic I captured showed long bursts of lost packets and/or duplicated ACKs on both client and server. A regular pattern of packet loss in TCP is nothing unusual when various devices along the way are traffic shaping, but long sustained periods of packet loss is not. I also verified the problem on several different networks and platforms. Experimenting lead me to the /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal option. This was difficult to find detailed information on exactly what it does and indications are that it has additional functionality added over time. The best I can work out is that it primarily allows packets which are outside the allowed TCP window size and possibly relaxes some other packet sanitisation. This suggests problems upstream with TCP window handling. The TCP window is essentially the buffer size on the receiving end. If it is receiving packets faster than it can process the data then the buffer will fill up and it will return packets with a smaller TCP window size which the sender uses to regulate the rate it is sending. If the buffer is completely full then the window size of zero is returned and the sender has to stop until a packet with a non-zero window size is returned. The dreaded TCP Window ScalingThe next obvious thing to try was /proc/sys/net/ipv4/tcp_window_scaling - an old favourite when connections are running slow or stalling. To improve performance on today's higher speed connections, TCP window scaling had been introduced in RFC1323. This essentially allowed a multiplier to be specified for the TCP window so that far larger buffers could be used which benefits today's higher speed connections. Disabling this cleared the problem, but only when it was disabled on both client and server. It also speeded up transfer by around 15% over the "be liberal" option only, presumably due to less packet loss from packets erroneously being marked as bad by misbehaving network equipment. RFC1323 is dated 1992: 17 years ago. By now the vast majority of TCP/IP stacks should handle this and as of Windows Vista, Microsoft enabled TCP window scaling by default (some people suggest that it also can't be disabled in Vista, or at least not easily). TCP window scaling was not the first thing on my mind with this fault - by now any network equipment vendor should have got their devices compliant. The problem is that many network equipment manufacturers either have been slow to support TCP Window Scaling or had bugs in their software / firmware which meant that they discard packets as invalid or other strange stuff when TCP Window Scaling is in use. Some leading network equipment manufacturers issued advisories about problems in their products relating to TCP window scaling as recently as a few months ago. This obviously hasn't filtered down to the data centres and all ISPs yet. As a result I have had to implement workarounds on the servers we are deploying:
These options may also be added to syscntl so they are set at boot time: net.ipv4.tcp_window_scaling = 0 Boot Gocha!One thing to watch out for if you set these at boot is that ip_conntrack_tcp_be_liberal will often not be set during boot as the connection tracking modules are not loaded yet. Linux loads these on demand so the when the firewall rules start getting set the modules load automatically. If you are having trouble with ip_conntrack_tcp_be_liberal not being set after boot then it is likely the cause of the problem. Personally, I have added this into the firewall scripts after the main rules have been setup to avoid this problem, else manually loading the modules before setting this option should also work. |
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|