I recently spent an unhealthy amount of days troubleshooting performance issues between remote Data Centers. Good thing I did, too, as I got a friendly reminder about TCP, and how latency drives throughput.
We were seeing seemingly inconsistent network issues, some applications and file transfers were slow, some were fast, and some appeared to be slow in only one direction. Jobs that used to run in minutes, now took hours. Packet captures were showing possible signs of packet loss with DUP ACKs, etc. Needless to say, we had some troubleshooting ahead of us to either locate the source of packet loss, or rule out the network.
First, we verified basic health of the network:
- Validated no congestion on links between endpoints
- Validate interface counters – look for errors, CRCs, drops, etc.
- Validate interface configurations – speed, duplex, MTU, etc.
- Validate QoS stats – Are you seeing an unusually high number of drops in a particular queue?
- Validate network device system resources – CPU, Memory, etc.
- Validate the control plane – Are packets getting punted to the router CPU?
All good so far, so why are we seeing the slowness with file transfers? Throughput on some transfers are as low as 900KBps. We have a 1Gbps link between sites, with only 18ms of latency (round-trip time / RTT), we should have no issue with throughput!
Looking at the packet captures, we learned TCP Windows were advertising at a very small size of 17,520 bytes, and not scaling.
This is a problem because of this very simple equation:
Receive buffer size (bits) / RTT (seconds) = Max TCP throughput (bps)
Let’s put this to work based on the observed Receive buffer size (TCP Window) and known RTT between sites:
(17,520 bytes * 8 bit) / .018 = ~7786666 bps
Convert this to KBps (KiloBytes) and we have our observed throughput: 950 KBps
To visualize this in Wireshark, click on a packet for the destination traffic, then go to Statistics > TCP StreamGraph > Window Scaling Graph
Notice the small Window Size:
Now, click on the Throughput Graph:
So, how do you correct this? First thing, you want to make sure your TCP Windows are large enough to utilize bandwidth available. Let’s perform another calculation to determine the TCP Window size we need in order to fully utilize the 1Gbps link with 18ms RTT.
Bandwidth (bps) * RTT (seconds) = TCP window size (bits)
Bandwidth = 1Gbps /1024/1024/1024 = 1,073,741,824 bps
RTT = 18ms = .018 seconds
1,073,741,824 * .018 = 19327352.832 / 8bits = TCP Window Size of 2,415,919.104 Bytes
Wow, looks like a TCP Window size of over 2MB would be needed to achieve the 1Gpbs throughput. First knee-jerk reaction is to just go around increasing the TCP window size on all of your servers. This may work, but you need to be careful, as this could consume more system memory and cause even worse performance issues if Selective ACKs are not enabled.
Good thing is, RFC 1323 (http://tools.ietf.org/html/rfc1323) defines TCP Window Scaling, allowing the Window size to grow dynamically. This is enabled by default since Vista / Server 2008, and is one of the reasons why we witnessed optimal performance for some applications, but not others. If you’re running something older, visit Microsoft for some how-to’s to enable: http://support.microsoft.com/kb/224829
With TCP Window Scaling enabled, we see a drastic change in our StreamGraph. Notice how it starts small, skyrockets and then hums around 1MB. This is completely dynamic based on observed latency thanks to RFC 1323’s bit shifting.
If you’re running primarily Microsoft, you can enable another feature called Congestion Control Provider. This is also known and Compound TCP and essentially scales TCP windows more aggressively.
Check to see if it is enabled by opening a command prompt and typing:
netsh interface tcp show global
Add-On Congestion Conrol Provider will be set to “none” if it is disabled, and “ctcp” if enable. To enable:
netsh interface tcp set global congestionprovider=ctcp
Now take a look at the aggressive TCP window scaling:
Helpful troubleshooting tools:
iperf / jperf – http://iperf.fr/ -also- http://openmaniak.com/iperf.php
Test throughput / packet loss
mturoute – http://www.elifulkerson.com/projects/mturoute.php
Validate MTU
robocopy – Built into Windows.
Use this instead of SMB when testing file transfers.
tcping – http://www.elifulkerson.com/projects/tcping.php
Probe TCP ports
winmtr – http://winmtr.net/download-winmtr/
Infinite tracerotue/ping tool with record keeping
Thank you for the troubleshooting guideline. Even though it is not directly related to my searches I found the information very interesting and could hopefully one day put it to good use as well.
This is really great! And a great reminder that not all packet loss is the fault of the network.