5. Performance Tuning and Troubleshooting

5.1. Tuning

OK, now we are up and running, and we want to be running at warp factor nine. No such thing as too fast, right?

Linux networking is pretty robust, even a default installation with no "tuning". You may well not need to do anything else. But if your connection is not performing up to what you think it should be, then possibly there is a problem somewhere. This may be a more worthwhile approach than the pursuit of any magical "tweak".

A very rough guideline on what you might reasonably expect as a maximum sync rate, based on distance from DSLAM/CO:

There are many conceivable factors that could effect this one way or the other. Newer generations of DSL will surely improve this, as will related technologies like repeaters.

You will loose 10-20% of the modem's attainable sync rate to networking overheads (TCP, ATM, ethernet). So a 1500 Kbps connection, is only going to realize about 1100-1300 Kbps or so of real world throughput. No tweaking is going to change the built-in protocol overheads. Also, if your service is capped at a lesser speed by your provider, then you can't get above that speed no matter what. AND -- that there are numerous variables that can effect your loop/signal quality, and subsequently your speed (aka sync rate). Some of these may be beyond your control.

But there are a few things that you might want to look at.

5.1.1. TCP Receive Window

For many of us, a default Linux installation is going to give something close to optimum performance. Windows 9x users often get a big boost by increasing their TCP Receive Window (RWIN). But this is because it is too small to start with. This is just not the case with Linux where the default value is 32KB.

The exception here is if you have to routinely deal with a high latency connection. For instance, if your provider has a satellite uplink that is consistently adding unusual latency (250ms or greater?). Then a larger TCP Window will likely help. For more on TCP Receive Window and related issues, look at http://www.psc.edu/networking/perf_tune.html.

The Receive Window is a buffer that helps control the flow of data. If set too low, it can be a bottleneck and restrict throughput. The optimum value for this depends completely on your bandwidth and latency. Latency being what you would find as average roundtrip time (RTT) based on your typical destinations and conditions. This can be determined with ping. For example, the Linux default of 32KB is acceptable up to speeds of 2 Mbps and a typical latency of 125ms or so, or 1.0 Mbps and latency of 250ms. Setting this value too high can also adversely effect throughput, so don't over do it.

An example courtesy of Juha Saarinen of New Zealand:

The commonly used formula for working out the the tcp buffer is the "bandwidth delay product" one:

������Buffer size = Bandwidth (bits/s) * RTT (seconds)

In my case, I have roughly 8Mbps downstream, but the ATM network can only support ~3.5Mbps sustained. I'm far away from the rest of the world, so to squeeze in a sufficient amount of 1,500 byte packets, with average RTTs of 250ms, I should probably have a buffer of (3,500,000/8)*.25 = 106KB. (I've got 128KB at the moment, which works fine.)

The Receive Window can be dynamically set in the /proc filesystem. This requires entering a value that is twice the desired buffer size:


 #echo 262144 > /proc/sys/net/core/rmem_default 
 #echo 262144 > /proc/sys/net/core/rmem_max

 

The above example actually sets the value to 128K. The Send Window can also be set, but is not as likely to be a limiting factor on DSL connections as the Receive Window:

 
 #echo 262144 > /proc/sys/net/core/wmem_default 
 #echo 262144 > /proc/sys/net/core/wmem_max

 

These values can also be set using the sysctl command. See the man page.

Other suggested kernel options for those who want to squeeze every last bit out of that copper (selected entries only):


 # sysctl -a 
 net.ipv4.tcp_rfc1337 = 1
 net.ipv4.ip_no_pmtu_disc = 0
 net.ipv4.tcp_sack = 1
 net.ipv4.tcp_fack = 1
 net.ipv4.tcp_window_scaling = 1
 net.ipv4.tcp_timestamps = 1
 net.ipv4.tcp_ecn = 0

 

A brief description of these, and other, options may be found in /usr/src/linux/Documentation/networking/ip-sysctl.txt, in the kernel source directory.

5.1.3. TCP Bottlenecks

DSL connections may suffer performance degradations under certain circumstances. Thankfully, Linux has very robust and flexible networking tools to help us deal with these.

One such common situation is where traffic bottlenecks are created whenever data from a fast network segment hits a slower one. Such as ethernet hitting a DSL modem/router. This can cause short term traffic backlogs, known as "queues" in the device. Queuing can result in degraded performance, particularly for interactive protocols (like telnet or ssh) and streaming protocols (like RealAudio), and increased latency for ICMP and other network protocols. This is most evident when the upstream link is saturated (since downstream data is queued at the ISP's end and we can't do as much about that). The queued traffic is processed such that lower volume traffic protocols (like ssh) often get drowned out so to speak, by the higher volume, bulk traffic (like http or ftp), as there isn't any special prioritizing in default usage.

And if the upstream queuing, or other factors, causes enough of a delay, it can even decrease downstream bandwidth utilization by slowing the ACKnowledgements (which are heading upstream), that are required to keep a download moving at optimal rates. So it is possible that an upload can hurt a simultaneous download.

Such effects can be largely mitigated with Linux's built-in traffic shaping abilities. The user space tool for manipulating the kernel's advanced traffic routing features is iproute, sometimes packaged as iproute2. This includes various tools that can classify and prioritize traffic with a considerable degree of flexibility. It also requires various kernel config options to be turned on. And is also fairly close to Black Magic ;-) The definitive document on this is the Advanced Routing and Traffic Control HOWTO (http://tldp.org/HOWTO/Adv-Routing-HOWTO.html). Pay particular attention to the "Cookbook" Section #15, and in particular #15.8, "The Ultimate Traffic Conditioner: Low Latency, Fast Up & Downloads". A great read!

5.2. Installation Problems

Read this section, if you have no sync at all or are completely unable to connect. See your modem's owner's manual for interpreting the modem's LEDs. (Many will show a solid red (or orange) light if not in sync.)

5.2.1. No sync

The modem sync LED has never been green.

5.2.2. Network Card (NIC) Problems

Symptoms here are: NIC is not recognized, modules won't load, or ifconfig shows the interface is not up, or is generating lots of errors, etc.

5.2.3. IP Connection Problems

Read this section if you are sure the modem is syncing, the NIC is recognized and seems to be working properly, client software is installed and running without error, but the connection to the ISP fails. Verify the modem is indeed syncing by the LED(s). An IP connection failure may be evidenced by ifconfig not showing an active eth0 interface (or ppp0 for PPPoX), or pinging gateway and other destinations generates 'network unreachable' or similar errors.

5.3. Sync Problems

Read this section if you have had a working connection, but now have lost sync, are intermittently losing sync, your sync rate has dropped significantly, or are getting a "sync/no surf" condition. (Better quality modems will have a way to report sync rate, usually via telnet or a web browser interface. See the owner's manual.)

A loss of sync indicates a problem with the DSLAM, your line (inside or outside) or your modem. DSLAMs typically have "shelves" with "cards". Alcatel DSLAM cards, just for instance, have a capacity of four connections each. If the card goes bad, at most four customers are effected. The point being that sync loss outages can be very isolated. Unlike network outages that tend to effect large numbers of users. Sync outages are a telco problem, not an ISP problem. If your service agreement is with the ISP, you will need to contact them, who will in turn contact the telco.

Degraded sync rates, and disruption of the DSL signal, can cause various problems. Obviously, you will never get your maximum throughput under these conditions. But, the symptoms are not always obvious as to whether the problem is on your end or the provider's.

For instance, a poor inside wire connection may result in retransmissions of packets that have been dropped. This can really reduce throughput and slow a connection down. It is tempting to think of packet loss as a traditional networking problem, but with DSL it is possible to be the result of a bad line, impaired signal, or even the modem itself.

Some things to try:

Another possibility is a nearby AM radio station, or bandit ham radio operator that are disrupting the DSL signal since they operate in a similar frequency range. These may only cause problems at certain times of day, like when the station boosts its signal at night. A good telco DSL tech may be able to help minimize the impact of this. YMMV.

5.4. Network and Throughput Problems

Read this section if your connection is up, but are having throughput problems. In other words, your speed isn't what it should be based on your bit rate plan, and your distance from the CO. "Network" here is the WAN -- the ISP's gateway and local subnet/backbone, etc. Remember that a marginal line can cause a reduced sync rate, and this will impact throughput. See above.

The two factors we will be looking for are "latency" and "packet loss". Both are pretty easy to track down with the standard networking tools ping and traceroute. If either of these occur in our path, they will impact performance. Latency means "responsiveness" or "lag time". Actually what we are interested in is abnormally high latency, since there is always some latency. Packet loss is when a packet of data gets dropped somewhere along the way. TCP/IP will know it's been "lost", and there will be a retransmission of the lost data. Enough of this can really slow things down. Ideally packet loss should be 0%.

What we really need to be concerned about is that part of the WAN route that we routinely traverse. If you do a traceroute to several different sites, you will probably see that the first few "hops" tend to be the same. These are your ISP's local backbone, and your ISP's upstream provider's gateway. Any problem with any of this, and it will effect everywhere you go and everything you do.

We can start looking for packet loss and latency by pinging two or three different sites, hopefully in at least a couple of different directions. We will be looking for packet loss and/or unusually high latency.


 $ ping -c 12 -n www.tldp.org
 PING www.tldp.org (152.19.254.81) : 56(84) bytes of data.
 64 bytes from 152.19.254.81: icmp_seq=0 ttl=242 time=62.1 ms
 64 bytes from 152.19.254.81: icmp_seq=1 ttl=242 time=60.8 ms
 64 bytes from 152.19.254.81: icmp_seq=2 ttl=242 time=59.9 ms
 64 bytes from 152.19.254.81: icmp_seq=3 ttl=242 time=61.8 ms
 64 bytes from 152.19.254.81: icmp_seq=4 ttl=242 time=64.1 ms
 64 bytes from 152.19.254.81: icmp_seq=5 ttl=242 time=62.8 ms
 64 bytes from 152.19.254.81: icmp_seq=6 ttl=242 time=62.6 ms
 64 bytes from 152.19.254.81: icmp_seq=7 ttl=242 time=60.3 ms
 64 bytes from 152.19.254.81: icmp_seq=8 ttl=242 time=61.1 ms
 64 bytes from 152.19.254.81: icmp_seq=9 ttl=242 time=60.9 ms
 64 bytes from 152.19.254.81: icmp_seq=10 ttl=242 time=62.4 ms
 64 bytes from 152.19.254.81: icmp_seq=11 ttl=242 time=63.0 ms
 
 --- www.tldp.org ping statistics ---
 12 packets transmitted, 12 packets received, 0% packet loss
 round-trip min/avg/max = 59.9/61.8/64.1 ms

 

The above example is pretty normal from here. (You probably have a very different route to this site, and your results may thus be quite different.) Apparently no serious underlying problems that would slow me down. The below example reveals a problem:


 $ ping -c 20 -n www.debian.org
 
 PING www.debian.org (198.186.203.20) : 56(84) bytes of data.
 64 bytes from 198.186.203.20: icmp_seq=0 ttl=241 time=404.9 ms
 64 bytes from 198.186.203.20: icmp_seq=1 ttl=241 time=394.9 ms
 64 bytes from 198.186.203.20: icmp_seq=2 ttl=241 time=402.1 ms
 64 bytes from 198.186.203.20: icmp_seq=4 ttl=241 time=2870.3 ms
 64 bytes from 198.186.203.20: icmp_seq=7 ttl=241 time=126.9 ms
 64 bytes from 198.186.203.20: icmp_seq=12 ttl=241 time=88.3 ms
 64 bytes from 198.186.203.20: icmp_seq=13 ttl=241 time=87.9 ms
 64 bytes from 198.186.203.20: icmp_seq=14 ttl=241 time=87.7 ms
 64 bytes from 198.186.203.20: icmp_seq=15 ttl=241 time=85.0 ms
 64 bytes from 198.186.203.20: icmp_seq=16 ttl=241 time=84.5 ms
 64 bytes from 198.186.203.20: icmp_seq=17 ttl=241 time=90.7 ms
 64 bytes from 198.186.203.20: icmp_seq=18 ttl=241 time=87.3 ms
 64 bytes from 198.186.203.20: icmp_seq=19 ttl=241 time=87.6 ms
 
 --- www.debian.org ping statistics ---
 20 packets transmitted, 13 packets received, 35% packet loss
 round-trip min/avg/max = 84.5/376.7/2870.3 ms

 

High packet loss at 35%, and some really slow roundtrip times in there as well. A little digging on this showed that it was a backbone router 13 hops into the traceroute that was the problem. While making this site really slow from here, it would only effect those routes that happen to hit that same router. Now what would really hurt us is if something similar happens with a router that we tend to go through consistently. Like our gateway, or maybe the second hop router too. Find these with traceroute, by just picking a random site:


 $ traceroute www.bellsouth.net
 
 traceroute to bellsouth.net (192.223.22.134), 30 hops max, 38 byte packets
  1  adsl-78-196-1.sdf.bellsouth.net (216.78.196.1)  14.86ms  7.96ms 12.59ms
  2  205.152.133.65 (205.152.133.65)                  7.90ms  8.12ms  7.73ms
  3  205.152.133.248 (205.152.133.248)                8.99ms  8.52ms  8.17ms
  4  Hssi4-1-0.GW1.IND1.ALTER.NET (157.130.100.153)  11.36ms 11.48ms 11.72ms
  5  125.ATM3-0.XR2.CHI4.ALTER.NET (146.188.208.106) 14.46ms 14.23ms 14.40ms
  6  194.at-1-0-0.TR2.CHI2.ALTER.NET (152.63.65.66)  16.48ms 15.69ms 16.37ms
  7  126.at-5-1-0.TR2.ATL5.ALTER.NET (152.63.0.213)  65.66ms 66.18ms 66.39ms
  8  296.ATM6-0.XR2.ATL1.ALTER.NET (152.63.81.37)    66.86ms 66.42ms 66.40ms
  9  194.ATM8-0.GW1.ATL3.ALTER.NET (146.188.233.53)  67.87ms 68.69ms 69.63ms
 10  IMVI-gw.customer.ALTER.NET (157.130.69.202)     69.88ms 69.25ms 69.35ms
 11  www.bellsouth.net (192.223.22.134)              68.74ms 69.06ms 68.05ms

 

The first hop is the gateway. In fact, for me the first two hops are always the same, and the first three or four are often the same. So a problem with any of these may cause a problem anywhere I go. (The specifics of your own situation may be a little different than this example.) A "normal" gateway ping (normal for me!):

 
 $ ping -c 12 -n 216.78.196.1
 
 PING 216.78.196.1 (216.78.196.1) : 56(84) bytes of data.
 64 bytes from 216.78.196.1: icmp_seq=0 ttl=64 time=14.6 ms
 64 bytes from 216.78.196.1: icmp_seq=1 ttl=64 time=15.4 ms
 64 bytes from 216.78.196.1: icmp_seq=2 ttl=64 time=15.0 ms
 64 bytes from 216.78.196.1: icmp_seq=3 ttl=64 time=15.2 ms
 64 bytes from 216.78.196.1: icmp_seq=4 ttl=64 time=14.9 ms
 64 bytes from 216.78.196.1: icmp_seq=5 ttl=64 time=15.3 ms
 64 bytes from 216.78.196.1: icmp_seq=6 ttl=64 time=15.4 ms
 64 bytes from 216.78.196.1: icmp_seq=7 ttl=64 time=15.0 ms
 64 bytes from 216.78.196.1: icmp_seq=8 ttl=64 time=14.7 ms
 64 bytes from 216.78.196.1: icmp_seq=9 ttl=64 time=14.9 ms
 64 bytes from 216.78.196.1: icmp_seq=10 ttl=64 time=16.2 ms
 64 bytes from 216.78.196.1: icmp_seq=11 ttl=64 time=14.8 ms

 --- 216.78.196.1 ping statistics ---
 12 packets transmitted, 12 packets received, 0% packet loss
 round-trip min/avg/max = 14.6/15.1/16.2 ms

 

And a problem with the same gateway on a different day:


 $ ping  -c 12 -n 216.78.196.1
 
 PING 216.78.196.1 (216.78.196.1) : 56(84) bytes of data.
 64 bytes from 216.78.196.1: icmp_seq=0 ttl=64 time=20.5 ms
 64 bytes from 216.78.196.1: icmp_seq=3 ttl=64 time=22.0 ms
 64 bytes from 216.78.196.1: icmp_seq=4 ttl=64 time=21.8 ms
 64 bytes from 216.78.196.1: icmp_seq=6 ttl=64 time=32.0 ms
 64 bytes from 216.78.196.1: icmp_seq=8 ttl=64 time=21.7 ms
 64 bytes from 216.78.196.1: icmp_seq=9 ttl=64 time=42.0 ms
 64 bytes from 216.78.196.1: icmp_seq=10 ttl=64 time=26.8 ms
 
 --- adsl-78-196-1.sdf.bellsouth.net ping statistics ---
 12 packets transmitted, 7 packets received, 41% packet loss
 round-trip min/avg/max = 20.5/25.6/42.0 ms

 

41% packet loss is very high, to the point where many services, like HTTP, come to a screeching halt. Those services that were working, were working very, very slowly.

It's a little tempting on this last real-life example to think this gateway router is acting up. But, as it turned out, this was the result of a problem in the DSLAM/ATM segment of the telco's network. So any first hop problem with packet loss or high latency, may actually be the result of something occurring before the first hop. We just don't have the tools to isolate where it is starting well enough. Packet loss can be a telco problem, just as much as an ISP/NSP problem. Or conceivably, even a modem problem. In which case try resetting the modem by power cycling and by unplugging/replugging the DSL cable (from the wall jack).

It is also quite possible for the modem itself to cause packet loss. The fix here is to power cycle the modem, and resync by unplugging the DSL connection for 30 seconds or so. In fact, any part of the connection can be a source of packet loss -- modem, DSLAM, ATM network, etc.

If you do find a problem within your ISP's network, it's time to report the problem to tech support.

5.4.1. Miscellaneous Network Problems

Some odds and ends:

5.5. Measuring Throughput

One of the first things most of us do is check our speeds to make sure we aren't getting short changed, and that our system is up to snuff. Doing this accurately is easier said than done however. First, remember you are losing 10-20% right off the top due to networking protocol overhead. Just how much is "lost" here depends on your provider's network architecture, where and how you are measuring this and other considerations. Most of us may wind up being closer to 20% than 10%.

Then, any time you hit the Internet, there is some slight degradation of performance with each hop you take. Now this may not amount to much, as long as you are not taking too many hops and all the components -- your system, your ISP's network, your ISP's upstream provider, and the destination itself -- are all working like well oiled machines. But there's the rub -- how do you really know with so many variables in the mix? One flaky interface, on one router, on one hop along the path, may cause misleading results.

Your absolute max speed is going to be at your point of connection to your ISP -- the ISP's gateway. It can only go downhill from there, not up! So the ideal test is as close to home as possible. This eliminates as many unknown variables as possible. If your ISP has a local ftp server, this is an excellent place to run your own tests. (Run a traceroute though just to see how local it really is.)

If your ISP does not have this, look for an ftp site that is close -- the fewer the hops, the better. And look for one that isn't too busy, or you will get misleading results. Find a large file -- like 10 Megs -- and time the download. Try this over several days, and at different times of day. The server, and the backbone, are going to be busier at certain times of day, which can skew results and you want to eliminate these variables as much as possible. Your provider cannot compensate for heavy backbone traffic, backbone bottlenecks, slow or busy servers, etc.

There are many test sites scattered around the web. Some are better than others, but take these with a grain of salt. There are just too many variables for these tests to reliably give you an accurate snapshot of your connection and throughput. They may give you a general picture of whether you are in the ballpark of where you think you should be or not. One good speed test is http://www.dslreports.com/stest/0. Another test is http://speedtest.mybc.com/ (both are Java). I find these to be better than some of the others out there.

Now keeping in mind that we are limited by the ~10-20% networking overhead rule, here is an example. My speed is capped at 1472 Kbps sync rate. Minus the ~15% is 1275 Kbps. My sync rate is known to be good and my distance to the CO is about 11,000 Ft, which is close enough that I should be able to hit my real world maximum throughput of 1275 Kbps or roughly 1.2-1.3 Mbps -- all other things being equal. From dslreports.com speed test:


 Test running..Downloaded 60900bytes in 5918ms
 Downloaded 696000bytes in 4914ms
 First guess is 1133kbps
 fairly fast line - now test 2mb
 Downloaded 1679100bytes in 11090ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 211ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 205ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 207ms
 Upload got ok 50000 bytes uploaded
 Uploaded 50000bytes in 2065ms
 Upload got ok 100000 bytes uploaded
 Uploaded 100000bytes in 3911ms
 
 ** Speed 1211(down)/215(up) kbps **
 (At least 24 times faster than a 56k modem)
 Finish.

 

1.211 Mbps is probably about as good as I can realistically expect based on my service. There is no reason for me to go troubleshooting or looking for tweaks.

Big Caution: my ISP uses a caching proxy server for web pages. This is a big equalizer for these kinds of web based tests. Without that, I surely would have been significantly slower on this test. The effect of the proxy is that you are actually testing throughput from the proxy -- NOT the test site. Just FYI. Another note: at the same time I tried another test site and was consistently getting 600-700 Kbps. So YMMV with these tests. (Usually I get the same on each, more or less.) Timing a large ftp download from two different sites, I calculated about 1.25 Mbps.