Network Diagnostics

From Pulsed Media Wiki
Revision as of 13:38, 29 December 2023 by Nucode (talk | contribs)

Network Diagnostics Comprehensive Guide

Network performance is crucial for the smooth operation of dedicated servers. However, diagnosing network issues can be very challenging. This document aims to guide through effective network testing methods to identify potential issues. Understanding the nature of the internet - a vast network with numerous interconnected routes and nodes - is key to recognizing why network issues are often not within the immediate control of your hosting provider.

Why In-Depth Network Testing is Essential

Internet's Complex Nature

  1. The internet is a complex web of interconnected networks. It's common for routes to experience disruptions due to various factors like maintenance, outages, or heavy traffic (congestion).
  2. Always Broken; The Internet is always broken somewhere, all the time; Sometimes the fixes can take months to even years to be implemented, at extreme costs. Patience.
  3. Different connections follow different paths across the network, leading to varied experiences for different users.
  4. Every connection can potentially have tens of thousands of components involved; Issue with even one will cause disruptions.

Identifying the True Source of the Issue

  1. In most cases (over 90%), reported network issues are not directly related to the server or hosting provider.
  2. Problems often lie in the route your data takes through the internet, which involves third-party networks. Third-party networks are not under control of your hosting provider.
  3. Network issues are notoriously difficult to diagnose, proper testing is a must. Do not waste network engineer time with "no work" or "bad speed" messages.

No Data; No Solution is Possible

Without data there can be no possible solution. The onus is always on the user to do the initial basic testing, every large network operator will ignore all requests without exact, well collated data and synopsis of the issue. Reports along lines "xyz is slow!" goes to /dev/null typically.

Do not waste a networking professional's time without first doing the basic testing, and if more information is requested never ignore and just complete the testing. Otherwise your report will again go to /dev/null.

Sometimes when issue is finally identified, it is most likely still going to be ignored if report comes from 3rd party. Do not ask, or hold your hosting provider responsible for fixing other operators networks. If your testing identifies the issue elsewhere than your hosting provider, such as your ISP, contact your ISP directly.

Understanding Key Network Metrics: Jitter, Ping, Packet Loss, and Routes

For comprehensive network diagnostics, it’s essential to understand the significance of various network metrics. Each of these metrics offers valuable insights into the quality and reliability of a network connection.

Ping (Latency)

  1. Ping, or latency, measures the time it takes for data to travel from the source to the destination and back.
  2. It's crucial to have low latency for activities requiring real-time response, such as video conferencing and gaming.
  3. Latency directly affects potential network throughput due to TCP window sizes and TCP ACK packet delays. High latency means more packets need to be in-flight.
  4. Uses ICMP Echo packets to measure latency

Jitter

  1. Jitter refers to the variation in delay (latency, "ping") of received packets.
  2. High jitter can cause issues in real-time applications like VoIP or online gaming, where consistent timing is crucial.
  3. Jitter is measured by observing the time difference between successive packets. Consistent packet timing leads to low jitter, which is desirable for a stable connection.
  4. High Jitter on High Latency link may lead to TCP restart events / TCP Window issues, leading to slow total throughput.

Packet Loss

  1. Packet loss occurs when packets of data being transmitted across a network fail to reach their destination completely.
  2. It can be caused by network congestion, hardware failures, or signal degradation.
  3. High packet loss leads to interruptions and degradation in service quality, particularly affecting streaming, downloading, and online gaming.
  4. Packet loss causes TCP Window to reset, causing low throughput speeds. Higher the latency, the higher the effect.
  5. Tested with Ping typically. Internet Routers may have packet loss, since ICMP Echo packets are lowest priority handled by router's CPU.

Routes

  1. The path or route taken by data packets across the network significantly affects overall network performance.
  2. Data can traverse multiple routers and networks, each potentially impacting speed and reliability.
  3. Understanding the routes helps in pinpointing where delays or packet losses are occurring, especially when troubleshooting network issues.
  4. Routes are commonly different to each way, therefore testing both ways is important.
  5. Network provider cannot really affect which way 3rd party sends the packets (route), this affects downstream latency and throughput.
  6. Network provider can have control over the route packets leave towards to a 3rd party, this affects upstream latency and throughput.
  7. Routes are typically dynamic and can change often, these are rarely manually optimized per target.

By monitoring and analyzing these metrics, users can better understand the health and performance of their network connections. This knowledge is essential for diagnosing issues and optimizing network performance.


Testing methodologies

Speed Tests aka Throughput; Inherently Unreliable Testing

Speed tests can be an unreliable measure of network health. Network conditions constantly fluctuate, and third-party testing services have their limitations. Third-party testing servers are also often very busy. Especially speeds over 1Gbps can be difficult to measure. Most 3rd party testing servers are 10Gbps max.

Speed tests, while popular, can be an unreliable measure of network health due to various factors. Here are key reasons why reliance on speed tests alone is not advisable:

  1. Third-Party Networks: Speed tests often involve data traveling through networks outside the control of your hosting provider. These third-party networks can have varying performance due to their own traffic management policies and network health.
  2. Transits and Peerings: The path data takes typically goes through several links and networks, each potentially affecting speed and performance. The complexity of these routes means that a speed test to one location will yield completely different results compared to another, even if both are equidistant.
  3. Inconsistent Results Across Different Tests: Due to the complexities of internet routing, different speed tests can yield varying results. Each test may involve data traveling through distinct paths, encountering unique network conditions along the way.
  4. Network Variability: The internet's network conditions are in constant flux. This variability can result from traffic congestion, maintenance activities, and outages, all of which can temporarily impact speed test results.
  5. Third-Party Server Limitations: Most speed test results are dependent on the performance of third-party servers. These servers can be busy or have limitations in their capacity, especially for high-speed connections. The majority of third-party testing servers have a maximum capacity of 10Gbps, which can be insufficient for accurately measuring speeds over 1Gbps.
  6. Indication of Server Performance: Despite their limitations, if at least one or a few speed tests show good speeds, it's a strong indication that the server itself is functioning properly. Consistently high speeds in multiple tests, especially from different testing platforms, further reinforce this.
  7. Client-Side Factors: The accuracy of speed tests can also be influenced by factors on the user's end, such as local network issues, the performance of the testing device, and the browser or application used for the test. Most typical is using WI-FI. If you are using WI-FI, start diagnosing from there. Experience shows "to home" speed issues are almost always due to utilizing WI-FI.
  8. Limited Scope of Testing: Speed tests primarily measure bandwidth and latency but do not provide comprehensive insights into other critical aspects of network performance, such as packet loss, jitter, and the stability of the connection over time.
  9. User's Server Configuration: The configuration of Your server plays a crucial role in network performance. Non-standard kernel configurations, especially those related to TCP and MTU window sizes, can significantly skew test results. TCP/MTU window sizes are vital as they determine how much data can be sent before requiring an acknowledgment – in scenarios with higher latency, the impact of incorrectly set window sizes becomes more pronounced, potentially leading to reduced throughput and performance issues.
  10. LACP/LAG, Link Aggregation: Links are typically build from aggregating multiple links, for example a bunch of 10G links to build a 100G link. This is for costs, but limits the ultimate maximum per target or per connection performance. ie. 100G Link from 10x10G might have 50G utilization, which means single connection can only get to 5G maximum. This is normal operation.
  11. Guaranteed Does Not Exist In Reality: There is no such thing as guaranteed throughput, this is impossible to provide to any and all network targets, it would require full control of all devices, every piece of software, and individual fiber to every target, from every target.
  12. Traffic Shaping: Many ISPs are known to do traffic shaping in various ways, sometimes it is refusing to upgrade a peering or transit link, sometimes it's protocol based, sometimes user based, etc. Various methods.
  13. Correct Expectations: You cannot expect a 56k dialup in rural india to be able to send or receive at speeds in excess of the 56k dialup. Getting throughputs beyond 1Gbps single connection are difficult, beyond 10Gbps nigh impossible. See LACP/LAG, Link Aggregation above.

Speed tests can offer some insights into network performance, but they should only be used as part of a broader diagnostic strategy. For a more accurate assessment of network health, combining speed tests with other tools like MTR analysis is recommended. This approach helps in identifying whether network issues are indeed related to the server or if they lie elsewhere in the complex web of internet connectivity.

Speed Testing Tools

There are a lot of tools for testing, these are the most common.

Yabs.sh

Popular tool for general server performance test, while limited, this is what a lot of people run as default. It gives a hint of relative server performance, since all tests are the same it does give decent indication, for most part. It's well known that yabs.sh limited number of network speed test servers are often congested at this time.

Run yabs.sh:

wget -qO- yabs.sh | bash
network-speed.xyz

This is like yabs.sh but dedicated only for network tests, running larger number of tests, allowing choosing regionality etc.

Run network-speed.xyz test:

wget -qO- network-speed.xyz | bash
Speedtest by Ookla

speedtest-cli is another very common tool to use, or speedtest.net. This tool only tests on single server, closest it can find. Due to geolocation awareness this actually gives one of the more reliable results, a close by server. Sometimes these servers are congested as well, so testing multiple is key if first one gives you bad results; It could be that particular test server is congested.

Do not Use The Python Version -- This is known to have measurement issues.

Iperf3

Best tool for measuring point to point, this is heavily optimized and has many options for various testing methods and parallelism.

Install on Debian / Ubuntu and starting a server is simple;

apt install -y iperf3
iperf3 -s
# Performing a test: iperf3 -c [Server IP Address]

Network Diagnostics; Route, Ping, Jitter

Tools

Tools for basic network diagnostics are the same commonly, while others can exist, these are the ones most typically used.

MTR, WinMTR

This is the most important and essential tool, see Network Troubleshooting with MTR for comprehensive guide. Always do a MTR test both ways, in minimum of 1000+ packets if you suspect an issue.

MTR gives you all typical information; Latency, Packet Loss, Jitter, Route -- all in one.

Ping

Test latency between 2 end points. Simple basic test to quickly check if you get an response. This should be run 1000+ packets, or preferrably hours and hours, to gather long term average data. You may lower the time between pings or run multiple concurrently.

To test, Linux or Windows;

ping [server ip or hostname, ie. google.com]
Traceroute

Trace the route to the other server, which network hops it has, how the packets are being routed _to_ the target. This only tests from to the target, you need to run this from the target as well to get complete picture, like MTR.

To test, Linux;

traceroute [server ip or hostname, ie. google.com]

To test, Windows;

tracert [server ip or hostname, ie. google.com]

Effective Testing; How to get actual results

Determine if this is actual issue and set expectations

These at least are not actual issues;

  1. I have 56k dialup, i'm not getting 10Gbps download! 😡 - Reality check: Time travel isn't a feature yet! (Give us a few more moments to work on that!)
  2. Shared service, entry level service, shared/fairshare connections; Getting only say 300Mbps per connection max. It's not a bug; it's a budget!
  3. One Slow Target: 1 target out of 1000 is slower than others, it's likely an issue with that particular target, not your connection.
  4. Seedbox swarm has no leechers
  5. Seedbox swarm has leechers but only getting low speeds (you cannot force data to anyone!)
  6. Intergalactic Connectivity Woes: My friend in antarctica/north korea/space station/moon/mars/venus is only getting 5Kbps. Joke right? Well at least getting a connection. How did he get to Mars?
  7. Mobile and Wireless Limitations: You are on 4G, 5G or WIFI
  8. Speed Overload: Over 1Gbps on single connection / thread -- more than acceptable (12/2023)
  9. Cannot do 100% Link 24/7/365, Did not buy dedicated link. Remember, it's a shared road, not a private racetrack.
  10. I paid 5€ for my service, not getting 10Gbps 24/7/365, YOU SCAMMER! The only scam here is expecting caviar at fast-food prices!
  11. This single random provider in Uzbekistan/Mongolia/Zimbabwe/Papua New-Guinea/Etc i can only pull X speed! (but everything else is fast)
  12. Patience is Key: Issue has persisted less than 48hrs and is not complete connection loss (low throughput, jitter, dropouts). The internet is a living, evolving entity.
  13. Misunderstanding Capacity: Can Only Get 100MiB/s on my 1Gbps Link, Not even Close To 1000MiB/s! You Fraud! - A gentle reminder: Bits and bytes are different; your math might need recalibrating!
  14. Selective Slowdowns: If only one random provider in a remote location is slow but everything else is fast, it's likely an issue specific to that route or provider.
  15. MTR Shows packet loss at Hop 5 out of 9: That's not an issue, remember routers regularly drop ICMP Echo packets. Packet loss has to persist from 1 hop onwards to the final target.
  16. Speed of Light Isn't Fast Enough: I clicked and it didn't happen instantly! - Unless you've discovered faster-than-light travel, a tiny delay is normal.
  17. I Want My Video Stream... NOW!: My streaming video buffered for a whole two seconds! - Patience, young Jedi. Even the Force needs a moment to load.
  18. It Works on My Friend's Cousin's Neighbor's Network (or other network XYZ): But this one random person I know gets better speeds! - And I have a friend who claims they saw a unicorn... Every network is different, just because network A is fast from network B does not mean network C behaves the same.
  19. Peak Hours? What's That?: Why is my transfer slow at 20:00?? Probably the same reason highway is slow at 16:30. It's a rush hour.

Bandwidth at provider level (IP Transit) is actually extremely expensive, sometimes 100s of times more expensive than you think. Providers actually pay much more for bandwidth because it's actually meant to be used at high level 24/7, with SLA and support contracts and all the networking hardware costs exponentially more than home/small office hardware. So set your expectations based on what level of service you actually bought, say if you paid 40€ for a year on shared entry level 10Gbps, you really shouldn't be expecting to be pushing petabytes of data each month -- just ability to burst at that level, most of the time. This is why 10Gbps dedicated link may cost you 750€ a month; and even then it's just dedicated to provider's network edge at maximum. Contention ratios exist everywhere in networking, everything is "oversold", especially at ISPs, these contention ratios may go even above 1000:1 - ie. 10 000x 1Gbps links sold, but they only have 10Gbps of bandwidth (home users use very very very little of bandwidth in reality).

While actual issues are important to solve, we should also recognize what are not actually issues. Most often your hosting provider also cannot do anything about it, it's almost always somewhere else where the issue resides.

First Steps

Before you call in the cavalry, try these simple steps. Often, they're all you need to solve the issue.


Connection issue to Home/Office

  1. Make sure you are on wired connection. Wireless is like weather – rather unpredictable.
  2. Reset your modem/gateway/router. It's cliché, but it works more often than you'd think.
  3. Check your network cables are actually attached (and intact).
  4. Check if this is persistent 24/7. Is the issue there all the time, or does it come and go like a mysterious ghost?
  5. Try with different ISP/Connection as well, does it persist? If the problem persists, it's likely not your primary connection's fault.

Connection issue to Mobile (4G, 5G, etc)

  1. See Connection issue to Home/Office
  2. Change location or wait past midnight; Issue Solved? Get proper wired connection.
  3. Get a proper wired connection or contact your carrier.

Dedicated Server / VPS

  1. Make sure you don't have "optimizations" enabled, sysctl/kernel tuning. Bring to default
  2. Make sure you are not using a lot of bandwidth while testing (ie. vnstat, bwm-ng)
  3. Make sure you don't have heavy tasks running

Write it down; Take notes; Proper Reports

Collect your data, yes, that means opening notepad! (or kate, or gedit! your favorite text editor)

After all tests are done, you may now format it and make the support request. Make certain the formatting is easy to read, and on top is the highlights of testing. Synopsis of 1-3 short sentences is important, so whoever reads your report knows exactly what to look for immediately.

No one will read your incomprehensive messy text and try to decipher what is what. No one wants to open random images without a reason, and especially no one will open random links to random sites.

Sometimes screenshots can be easier (ie. WinMTR), so feel free to save them, but do not expect anyone looks at them over the text data.

Testing Processes

Process changes depending upon what you are experiencing, here are some common ways.

Generic throughput issues consistently; No specific target / Multiple Targets

  1. Make sure your server is not already at high network utilization
  2. Repeat tests multiple times, at multiple times of the day; Determine if time of the day affects this.
  3. Do at least several types of tests (Yabs.sh alone is not sufficient, and will be ignored)
  4. Do network-speed.xyz and iperf3 tests
  5. Determine if this is an actual issue; Getting "Only" 900Mbps on 1Gbps link is not issue. 1 Specific Network being slow for 30minutes, is not an issue. High Latency links not getting single thread 75%+ of link speed are not an issue.
  6. Supply your IP with iperf3 server running when opening a support request with your provider

Throughput speed jitter; Specific target

This is when speeds are fine, but all of sudden speeds plummet and then slowly ramps up. This is when a TCP retransmit event has happened, or even packet loss. Goal is to determine if there is Jitter and/or packet loss, and is it widespread or just single target. MTR can show also route flapping.

  1. Do MTR Both ways, while the issue is happening
  2. If issue intermittent; Do MTR Both ways also while it's not happening. Mark down times of the day
  3. Do throughput testing from the server via network-speed.xyz _while_ happening and _while not happening_.

Connections drop randomly

We are looking for periods of heavy packet loss here, or total loss of connectivity.

  1. Run MTR and/or ping against the target consistently, preferrably both ways.
  2. Ping the gateway (these regularly have packet loss, but if it doesn't is an indicator that connection _to_ gateway works)
  3. Try to measure how long connection is dropping.
  4. If this is a server with Pulsed Media, request support to check your server ip latency & packet loss graphs (smokeping).

No Connection At All (100% Packet Loss)

Server has probably crashed, 100% packet loss for 15+ mins alone is more than sufficient for opening a ticket, no further data needed.

Submit a Report

So this is an actual issue, which keeps persisting? Time to write it up and contact support!

On the start of the ticket, make a summary of issue and supporting evidence. Be concise and clear text, nO MeSsYWriT1ng here. Include why you think there's an actual issue your provider can actually do something about, not just some ephemeral random internet shenanigans.

If the issue is not clearly at your hosting provider, check where you should submit the information actually. Your local ISP? Another hosting provider? Which network the issue actually resides at?

Finally, double check you did all the testing and made it as easy as possible for the network engineer to analyze and check if it warrants further inspection.

Be Patient; These issues sometimes takes years to resolve (!!). Networking gear is expensive and sometimes it's politics (Comcast, Cogent, Netflix, Net Neutrality ...). Single 100Gbps link may cost 100s of thousands of euros to install; It all depends if the equipment is already there, how long fiber runs etc.