No, really, it’s not the network.

I recently spent an unhealthy amount of days troubleshooting performance issues between remote Data Centers.  Good thing I did, too, as I got a friendly reminder about TCP, and how latency drives throughput.

We were seeing seemingly inconsistent network issues, some applications and file transfers were slow, some were fast, and some appeared to be slow in only one direction.  Jobs that used to run in minutes, now took hours.  Packet captures were showing possible signs of packet loss with DUP ACKs, etc.  Needless to say, we had some troubleshooting ahead of us to either locate the source of packet loss, or rule out the network.

First, we verified basic health of the network:

  • Validated no congestion on links between endpoints
  • Validate interface counters – look for errors, CRCs, drops, etc.
  • Validate interface configurations – speed, duplex, MTU, etc.
  • Validate QoS stats – Are you seeing an unusually high number of drops in a particular queue?
  • Validate network device system resources – CPU, Memory, etc.
  • Validate the control plane – Are packets getting punted to the router CPU?

All good so far, so why are we seeing the slowness with file transfers?  Throughput on some transfers are as low as 900KBps.  We have a 1Gbps link between sites, with only 18ms of latency (round-trip time / RTT), we should have no issue with throughput!

Looking at the packet captures, we learned TCP Windows were advertising at a very small size of 17,520 bytes, and not scaling.

small-tcp-window

This is a problem because of this very simple equation:

Receive buffer size (bits) / RTT (seconds) = Max TCP throughput (bps)

Let’s put this to work based on the observed Receive buffer size (TCP Window) and known RTT between sites:

(17,520 bytes * 8 bit) / .018 = ~7786666 bps

Convert this to KBps (KiloBytes) and we have our observed throughput:  950 KBps

To visualize this in Wireshark, click on a packet for the destination traffic, then go to Statistics > TCP StreamGraph > Window Scaling Graph

Screenshot_4_19_14__5_45_PM

Notice the small Window Size:

tcp-window-smb_PNG

Now, click on the Throughput Graph:

slow-throughput

So, how do you correct this?  First thing, you want to make sure your TCP Windows are large enough to utilize bandwidth available.  Let’s perform another calculation to determine the TCP Window size we need in order to fully utilize the 1Gbps link with 18ms RTT.

Bandwidth (bps) * RTT (seconds) = TCP window size (bits)

Bandwidth = 1Gbps /1024/1024/1024 = 1,073,741,824 bps
RTT = 18ms = .018 seconds

1,073,741,824 * .018 = 19327352.832 / 8bits = TCP Window Size of 2,415,919.104 Bytes

Wow, looks like a TCP Window size of over 2MB would be needed to achieve the 1Gpbs throughput.  First knee-jerk reaction is to just go around increasing the TCP window size on all of your servers.  This may work, but you need to be careful, as this could consume more system memory and cause even worse performance issues if Selective ACKs are not enabled.

Good thing is, RFC 1323 (http://tools.ietf.org/html/rfc1323) defines TCP Window Scaling, allowing the Window size to grow dynamically.  This is enabled by default since Vista / Server 2008, and is one of the reasons why we witnessed optimal performance for some applications, but not others.  If you’re running something older, visit Microsoft for some how-to’s to enable:  http://support.microsoft.com/kb/224829

With TCP Window Scaling enabled, we see a drastic change in our StreamGraph.  Notice how it starts small, skyrockets and then hums around 1MB.  This is completely dynamic based on observed latency thanks to RFC 1323’s bit shifting.

no-compound_png

If you’re running primarily Microsoft, you can enable another feature called Congestion Control Provider.  This is also known and Compound TCP and essentially scales TCP windows more aggressively.

Check to see if it is enabled by opening a command prompt and typing:

netsh interface tcp show global

Add-On Congestion Conrol Provider will be set to “none” if it is disabled, and “ctcp” if enable.  To enable:

netsh interface tcp set global congestionprovider=ctcp

netsh

Now take a look at the aggressive TCP window scaling:

compound_png

Helpful troubleshooting tools:

iperf / jperf – http://iperf.fr/  -also- http://openmaniak.com/iperf.php
Test throughput / packet loss

mturoute – http://www.elifulkerson.com/projects/mturoute.php
Validate MTU

robocopy – Built into Windows.
Use this instead of SMB when testing file transfers.

tcping – http://www.elifulkerson.com/projects/tcping.php
Probe TCP ports

winmtr http://winmtr.net/download-winmtr/
Infinite tracerotue/ping tool with record keeping

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s