Difference between revisions of "Network Diagnostics"

From Pulsed Media Wiki
(Generic throughput issues consistently; No specific target)
(Effective Testing; How to get actual results)
Line 154: Line 154:
  
 
== Effective Testing; How to get actual results ==
 
== Effective Testing; How to get actual results ==
 +
 +
=== Determine if this is actual issue ===
 +
 +
These at least are not actual issues;
 +
 +
# I have 56k dialup, i'm not getting 10Gbps download! 😡
 +
# Shared service, entry level service, shared/fairshare connections; Getting only say 300Mbps per connection max
 +
# 1 target out of 1000 is slower than others
 +
# Seedbox swarm has no leechers
 +
# Seedbox swarm has leechers but only getting low speeds (you cannot force data to anyone!)
 +
# My friend in antarctica/north korea/space station/moon/mars/venus is only getting 5Kbps. Joke right? Well at least getting a connection. How did he get to Mars?
 +
# You are on 4G, 5G or WIFI
 +
# Over 1Gbps on single connection / thread -- more than acceptable (12/2023)
 +
# Cannot do 100% Link 24/7/365, Did not buy dedicated link
 +
# I paid 5€ for my service, not getting 10Gbps 24/7/365, YOU SCAMMER!
 +
# This single random provider in Uzbekistan/Mongolia/Zimbabwe/Papua New-Guinea/Etc i can only pull X speed! (but everything else is fast)
 +
# Issue has persisted less than 48hrs and is not complete connection loss (low throughput, jitter, dropouts)
 +
# Can Only Get 100MiB/s on my 1Gbps Link, Not even Close To 1000MiB/s! You Fraud!
 +
  
 
=== First Steps ===
 
=== First Steps ===
  
 
Do these first. Most of the time this alone solves the issue.
 
Do these first. Most of the time this alone solves the issue.
 +
  
 
==== Connection issue to Home/Office ====
 
==== Connection issue to Home/Office ====

Revision as of 14:10, 29 December 2023

Network Diagnostics Comprehensive Guide

Network performance is crucial for the smooth operation of dedicated servers. However, diagnosing network issues can be very challenging. This document aims to guide through effective network testing methods to identify potential issues. Understanding the nature of the internet - a vast network with numerous interconnected routes and nodes - is key to recognizing why network issues are often not within the immediate control of your hosting provider.

Why In-Depth Network Testing is Essential

Internet's Complex Nature

  1. The internet is a complex web of interconnected networks. It's common for routes to experience disruptions due to various factors like maintenance, outages, or heavy traffic (congestion).
  2. Always Broken; The Internet is always broken somewhere, all the time; Sometimes the fixes can take months to even years to be implemented, at extreme costs. Patience.
  3. Different connections follow different paths across the network, leading to varied experiences for different users.
  4. Every connection can potentially have tens of thousands of components involved; Issue with even one will cause disruptions.

Identifying the True Source of the Issue

  1. In most cases (over 90%), reported network issues are not directly related to the server or hosting provider.
  2. Problems often lie in the route your data takes through the internet, which involves third-party networks. Third-party networks are not under control of your hosting provider.
  3. Network issues are notoriously difficult to diagnose, proper testing is a must. Do not waste network engineer time with "no work" or "bad speed" messages.

No Data; No Solution is Possible

Without data there can be no possible solution. The onus is always on the user to do the initial basic testing, every large network operator will ignore all requests without exact, well collated data and synopsis of the issue. Reports along lines "xyz is slow!" goes to /dev/null typically.

Do not waste a networking professional's time without first doing the basic testing, and if more information is requested never ignore and just complete the testing. Otherwise your report will again go to /dev/null.

Sometimes when issue is finally identified, it is most likely still going to be ignored if report comes from 3rd party. Do not ask, or hold your hosting provider responsible for fixing other operators networks. If your testing identifies the issue elsewhere than your hosting provider, such as your ISP, contact your ISP directly.

Understanding Key Network Metrics: Jitter, Ping, Packet Loss, and Routes

For comprehensive network diagnostics, it’s essential to understand the significance of various network metrics. Each of these metrics offers valuable insights into the quality and reliability of a network connection.

Ping (Latency)

  1. Ping, or latency, measures the time it takes for data to travel from the source to the destination and back.
  2. It's crucial to have low latency for activities requiring real-time response, such as video conferencing and gaming.
  3. Latency directly affects potential network throughput due to TCP window sizes and TCP ACK packet delays. High latency means more packets need to be in-flight.
  4. Uses ICMP Echo packets to measure latency

Jitter

  1. Jitter refers to the variation in delay (latency, "ping") of received packets.
  2. High jitter can cause issues in real-time applications like VoIP or online gaming, where consistent timing is crucial.
  3. Jitter is measured by observing the time difference between successive packets. Consistent packet timing leads to low jitter, which is desirable for a stable connection.
  4. High Jitter on High Latency link may lead to TCP restart events / TCP Window issues, leading to slow total throughput.

Packet Loss

  1. Packet loss occurs when packets of data being transmitted across a network fail to reach their destination completely.
  2. It can be caused by network congestion, hardware failures, or signal degradation.
  3. High packet loss leads to interruptions and degradation in service quality, particularly affecting streaming, downloading, and online gaming.
  4. Packet loss causes TCP Window to reset, causing low throughput speeds. Higher the latency, the higher the effect.
  5. Tested with Ping typically. Internet Routers may have packet loss, since ICMP Echo packets are lowest priority handled by router's CPU.

Routes

  1. The path or route taken by data packets across the network significantly affects overall network performance.
  2. Data can traverse multiple routers and networks, each potentially impacting speed and reliability.
  3. Understanding the routes helps in pinpointing where delays or packet losses are occurring, especially when troubleshooting network issues.
  4. Routes are commonly different to each way, therefore testing both ways is important.
  5. Network provider cannot really affect which way 3rd party sends the packets (route), this affects downstream latency and throughput.
  6. Network provider can have control over the route packets leave towards to a 3rd party, this affects upstream latency and throughput.
  7. Routes are typically dynamic and can change often, these are rarely manually optimized per target.

By monitoring and analyzing these metrics, users can better understand the health and performance of their network connections. This knowledge is essential for diagnosing issues and optimizing network performance.


Testing methodologies

Speed Tests aka Throughput; Inherently Unreliable Testing

Speed tests can be an unreliable measure of network health. Network conditions constantly fluctuate, and third-party testing services have their limitations. Third-party testing servers are also often very busy. Especially speeds over 1Gbps can be difficult to measure. Most 3rd party testing servers are 10Gbps max.

Speed tests, while popular, can be an unreliable measure of network health due to various factors. Here are key reasons why reliance on speed tests alone is not advisable:

  1. Third-Party Networks: Speed tests often involve data traveling through networks outside the control of your hosting provider. These third-party networks can have varying performance due to their own traffic management policies and network health.
  2. Transits and Peerings: The path data takes typically goes through several links and networks, each potentially affecting speed and performance. The complexity of these routes means that a speed test to one location will yield completely different results compared to another, even if both are equidistant.
  3. Inconsistent Results Across Different Tests: Due to the complexities of internet routing, different speed tests can yield varying results. Each test may involve data traveling through distinct paths, encountering unique network conditions along the way.
  4. Network Variability: The internet's network conditions are in constant flux. This variability can result from traffic congestion, maintenance activities, and outages, all of which can temporarily impact speed test results.
  5. Third-Party Server Limitations: Most speed test results are dependent on the performance of third-party servers. These servers can be busy or have limitations in their capacity, especially for high-speed connections. The majority of third-party testing servers have a maximum capacity of 10Gbps, which can be insufficient for accurately measuring speeds over 1Gbps.
  6. Indication of Server Performance: Despite their limitations, if at least one or a few speed tests show good speeds, it's a strong indication that the server itself is functioning properly. Consistently high speeds in multiple tests, especially from different testing platforms, further reinforce this.
  7. Client-Side Factors: The accuracy of speed tests can also be influenced by factors on the user's end, such as local network issues, the performance of the testing device, and the browser or application used for the test. Most typical is using WI-FI. If you are using WI-FI, start diagnosing from there. Experience shows "to home" speed issues are almost always due to utilizing WI-FI.
  8. Limited Scope of Testing: Speed tests primarily measure bandwidth and latency but do not provide comprehensive insights into other critical aspects of network performance, such as packet loss, jitter, and the stability of the connection over time.
  9. User's Server Configuration: The configuration of Your server plays a crucial role in network performance. Non-standard kernel configurations, especially those related to TCP and MTU window sizes, can significantly skew test results. TCP/MTU window sizes are vital as they determine how much data can be sent before requiring an acknowledgment – in scenarios with higher latency, the impact of incorrectly set window sizes becomes more pronounced, potentially leading to reduced throughput and performance issues.
  10. LACP/LAG, Link Aggregation: Links are typically build from aggregating multiple links, for example a bunch of 10G links to build a 100G link. This is for costs, but limits the ultimate maximum per target or per connection performance. ie. 100G Link from 10x10G might have 50G utilization, which means single connection can only get to 5G maximum. This is normal operation.
  11. Dedicated or Guaranteed Does Not Exist: There is no such thing as dedicated or guaranteed throughput, this is impossible to provide to any and all network targets, it would require full control of all devices, every piece of software, and individual fiber to every target, from every target.
  12. Traffic Shaping: Many ISPs are known to do traffic shaping in various ways, sometimes it is refusing to upgrade a peering or transit link, sometimes it's protocol based, sometimes user based, etc. Various methods.
  13. Correct Expectations: You cannot expect a 56k dialup in rural india to be able to send or receive at speeds in excess of the 56k dialup. Getting throughputs beyond 1Gbps single connection are difficult, beyond 10Gbps nigh impossible. See LACP/LAG, Link Aggregation above.

Speed tests can offer some insights into network performance, but they should only be used as part of a broader diagnostic strategy. For a more accurate assessment of network health, combining speed tests with other tools like MTR analysis is recommended. This approach helps in identifying whether network issues are indeed related to the server or if they lie elsewhere in the complex web of internet connectivity.

Speed Testing Tools

There are a lot of tools for testing, these are the most common.

Yabs.sh

Popular tool for general server performance test, while limited, this is what a lot of people run as default. It gives a hint of relative server performance, since all tests are the same it does give decent indication, for most part. It's well known that yabs.sh limited number of network speed test servers are often congested at this time.

Run yabs.sh:

wget -qO- yabs.sh | bash
network-speed.xyz

This is like yabs.sh but dedicated only for network tests, running larger number of tests, allowing choosing regionality etc.

Run network-speed.xyz test:

wget -qO- network-speed.xyz | bash
Speedtest by Ookla

speedtest-cli is another very common tool to use, or speedtest.net. This tool only tests on single server, closest it can find. Due to geolocation awareness this actually gives one of the more reliable results, a close by server. Sometimes these servers are congested as well, so testing multiple is key if first one gives you bad results; It could be that particular test server is congested.

Do not Use The Python Version -- This is known to have measurement issues.

Iperf3

Best tool for measuring point to point, this is heavily optimized and has many options for various testing methods and parallelism.

Install on Debian / Ubuntu and starting a server is simple;

apt install -y iperf3
iperf3 -s
# Performing a test: iperf3 -c [Server IP Address]

Network Diagnostics; Route, Ping, Jitter

Tools

Tools for basic network diagnostics are the same commonly, while others can exist, these are the ones most typically used.

MTR, WinMTR

This is the most important and essential tool, see Network Troubleshooting with MTR for comprehensive guide. Always do a MTR test both ways, in minimum of 1000+ packets if you suspect an issue.

MTR gives you all typical information; Latency, Packet Loss, Jitter, Route -- all in one.

Ping

Test latency between 2 end points. Simple basic test to quickly check if you get an response. This should be run 1000+ packets, or preferrably hours and hours, to gather long term average data. You may lower the time between pings or run multiple concurrently.

To test, Linux or Windows;

ping [server ip or hostname, ie. google.com]
Traceroute

Trace the route to the other server, which network hops it has, how the packets are being routed _to_ the target. This only tests from to the target, you need to run this from the target as well to get complete picture, like MTR.

To test, Linux;

traceroute [server ip or hostname, ie. google.com]

To test, Windows;

tracert [server ip or hostname, ie. google.com]

Effective Testing; How to get actual results

Determine if this is actual issue

These at least are not actual issues;

  1. I have 56k dialup, i'm not getting 10Gbps download! 😡
  2. Shared service, entry level service, shared/fairshare connections; Getting only say 300Mbps per connection max
  3. 1 target out of 1000 is slower than others
  4. Seedbox swarm has no leechers
  5. Seedbox swarm has leechers but only getting low speeds (you cannot force data to anyone!)
  6. My friend in antarctica/north korea/space station/moon/mars/venus is only getting 5Kbps. Joke right? Well at least getting a connection. How did he get to Mars?
  7. You are on 4G, 5G or WIFI
  8. Over 1Gbps on single connection / thread -- more than acceptable (12/2023)
  9. Cannot do 100% Link 24/7/365, Did not buy dedicated link
  10. I paid 5€ for my service, not getting 10Gbps 24/7/365, YOU SCAMMER!
  11. This single random provider in Uzbekistan/Mongolia/Zimbabwe/Papua New-Guinea/Etc i can only pull X speed! (but everything else is fast)
  12. Issue has persisted less than 48hrs and is not complete connection loss (low throughput, jitter, dropouts)
  13. Can Only Get 100MiB/s on my 1Gbps Link, Not even Close To 1000MiB/s! You Fraud!


First Steps

Do these first. Most of the time this alone solves the issue.


Connection issue to Home/Office

  1. Make sure you are on wired connection
  2. Reset your modem/gateway/router
  3. Check if this is persistent 24/7
  4. Try with different ISP/Connection as well, does it persist?

Connection issue to Mobile (4G, 5G, etc)

  1. See Connection issue to Home/Office
  2. Change location or wait past midnight; Issue Solved? Get proper wired connection.
  3. Get a proper wired connection or contact your carrier.

Dedicated Server / VPS

  1. Make sure you don't have "optimizations" enabled, sysctl/kernel tuning. Bring to default
  2. Make sure you are not using a lot of bandwidth while testing (ie. vnstat, bwm-ng)
  3. Make sure you don't have heavy tasks running

Write it down; Take notes; Proper Reports

Collect your data, yes, that means opening notepad! (or kate, or gedit! your favorite text editor)

After all tests are done, you may now format it and make the support request. Make certain the formatting is easy to read, and on top is the highlights of testing. Synopsis of 1-3 short sentences is important, so whoever reads your report knows exactly what to look for immediately.

No one will read your incomprehensive messy text and try to decipher what is what. No one wants to open random images without a reason, and especially no one will open random links to random sites.

Sometimes screenshots can be easier (ie. WinMTR), so feel free to save them, but do not expect anyone looks at them over the text data.

Testing Processes

Process changes depending upon what you are experiencing, here are some common ways.

Generic throughput issues consistently; No specific target / Multiple Targets

  1. Make sure your server is not already at high network utilization
  2. Repeat tests multiple times, at multiple times of the day; Determine if time of the day affects this.
  3. Do at least several types of tests (Yabs.sh alone is not sufficient, and will be ignored)
  4. Do network-speed.xyz and iperf3 tests
  5. Determine if this is an actual issue; Getting "Only" 900Mbps on 1Gbps link is not issue. 1 Specific Network being slow for 30minutes, is not an issue. High Latency links not getting single thread 75%+ of link speed are not an issue.
  6. Supply your IP with iperf3 server running when opening a support request with your provider

Throughput speed jitter; Specific target

This is when speeds are fine, but all of sudden speeds plummet and then slowly ramps up. This is when a TCP retransmit event has happened, or even packet loss. Goal is to determine if there is Jitter and/or packet loss, and is it widespread or just single target. MTR can show also route flapping.

  1. Do MTR Both ways, while the issue is happening
  2. If issue intermittent; Do MTR Both ways also while it's not happening. Mark down times of the day
  3. Do throughput testing from the server via network-speed.xyz _while_ happening and _while not happening_.

Connections drop randomly

We are looking for periods of heavy packet loss here, or total loss of connectivity.

  1. Run MTR and/or ping against the target consistently, preferrably both ways.
  2. Ping the gateway (these regularly have packet loss, but if it doesn't is an indicator that connection _to_ gateway works)
  3. Try to measure how long connection is dropping.
  4. If this is a server with Pulsed Media, request support to check your server ip latency & packet loss graphs (smokeping).

No Connection At All (100% Packet Loss)

Server has probably crashed, 100% packet loss for 15+ mins alone is more than sufficient for opening a ticket, no further data needed.