Difference between revisions of "Network Diagnostics"
(→Network Testing, Diagnosis and Reporting; How To Get Actual Results) |
(→ASICS; Application Specific Integrated Circuits. HW Routing and Switching) |
||
Line 341: | Line 341: | ||
=== ASICS; Application Specific Integrated Circuits. HW Routing and Switching === | === ASICS; Application Specific Integrated Circuits. HW Routing and Switching === | ||
+ | |||
+ | [[File:Illustration_of_networking_rack_in_datacenter.png|350px|left]] | ||
==== Designer forgot something; VLANS! ==== | ==== Designer forgot something; VLANS! ==== | ||
Line 351: | Line 353: | ||
This was not accelerated in hardware, and not in FIB, Forward Information Base of that router, so it dropped down to the CPU. Throughput at 0.1ms latency level dropped down to just few Mbps level. | This was not accelerated in hardware, and not in FIB, Forward Information Base of that router, so it dropped down to the CPU. Throughput at 0.1ms latency level dropped down to just few Mbps level. | ||
+ | |||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | |||
==== Incomplete error reporting and software; Firmware bugs for caching L1/L2/L3 ==== | ==== Incomplete error reporting and software; Firmware bugs for caching L1/L2/L3 ==== |
Revision as of 14:52, 29 December 2023
Contents
- 1 Network Diagnostics Comprehensive Guide
- 1.1 Why In-Depth Network Testing is Essential
- 1.2 Understanding Key Network Metrics: Jitter, Ping, Packet Loss, and Routes
- 1.3 Testing methodologies
- 1.4 Network Testing, Diagnosis and Reporting; How To Get Actual Results
- 1.4.1 Determine if this is actual issue and set expectations
- 1.4.2 First Steps
- 1.4.3 Write it down; Take notes; Proper Reports
- 1.4.4 Testing Processes
- 1.4.5 Submit a Report
- 1.4.6 Common Misconceptions
- 1.4.6.1 Higher Bandwidth Equals Faster Internet
- 1.4.6.2 Wired Connection Is Always Stable
- 1.4.6.3 Speed Test Results Are Absolute
- 1.4.6.4 Low Latency is Guaranteed on High-Speed Networks
- 1.4.6.5 Packet Loss is Always a Sign of a Bad Connection
- 1.4.6.6 All Network Issues Can Be Resolved by the ISP or Hosting Provider
- 1.4.6.7 Any Network Problem Can Be Quickly Fixed
- 1.4.6.8 Network Performance is Consistent Across All Applications
- 1.5 Real Life Examples of Issues
Network Diagnostics Comprehensive Guide
Network performance is crucial for the smooth operation of dedicated servers. However, diagnosing network issues can be very challenging. This document aims to guide through effective network testing methods to identify potential issues. Understanding the nature of the internet - a vast network with numerous interconnected routes and nodes - is key to recognizing why network issues are often not within the immediate control of your hosting provider.
Network diagnosis can be extremely challenging at times, so cooperation from reporting party (aka end-user) is a must have. Sometimes the issues seem like dark magic.
Why In-Depth Network Testing is Essential
Internet's Complex Nature
- The internet is a complex web of interconnected networks. It's common for routes to experience disruptions due to various factors like maintenance, outages, or heavy traffic (congestion).
- Always Broken; The Internet is always broken somewhere, all the time; Sometimes the fixes can take months to even years to be implemented, at extreme costs. Patience.
- Different connections follow different paths across the network, leading to varied experiences for different users.
- Every connection can potentially have tens of thousands of components involved; Issue with even one will cause disruptions.
Identifying the True Source of the Issue
- In most cases (over 90%), reported network issues are not directly related to the server or hosting provider.
- Problems often lie in the route your data takes through the internet, which involves third-party networks. Third-party networks are not under control of your hosting provider.
- Network issues are notoriously difficult to diagnose, proper testing is a must. Do not waste network engineer time with "no work" or "bad speed" messages.
No Data; No Solution is Possible
Without data there can be no possible solution. The onus is always on the user to do the initial basic testing, every large network operator will ignore all requests without exact, well collated data and synopsis of the issue. Reports along lines "xyz is slow!" goes to /dev/null typically.
Do not waste a networking professional's time without first doing the basic testing, and if more information is requested never ignore and just complete the testing. Otherwise your report will again go to /dev/null.
Sometimes when issue is finally identified, it is most likely still going to be ignored if report comes from 3rd party. Do not ask, or hold your hosting provider responsible for fixing other operators networks. If your testing identifies the issue elsewhere than your hosting provider, such as your ISP, contact your ISP directly.
Understanding Key Network Metrics: Jitter, Ping, Packet Loss, and Routes
For comprehensive network diagnostics, it’s essential to understand the significance of various network metrics. Each of these metrics offers valuable insights into the quality and reliability of a network connection.
Ping (Latency)
- Ping, or latency, measures the time it takes for data to travel from the source to the destination and back.
- It's crucial to have low latency for activities requiring real-time response, such as video conferencing and gaming.
- Latency directly affects potential network throughput due to TCP window sizes and TCP ACK packet delays. High latency means more packets need to be in-flight.
- Uses ICMP Echo packets to measure latency
Jitter
- Jitter refers to the variation in delay (latency, "ping") of received packets.
- High jitter can cause issues in real-time applications like VoIP or online gaming, where consistent timing is crucial.
- Jitter is measured by observing the time difference between successive packets. Consistent packet timing leads to low jitter, which is desirable for a stable connection.
- High Jitter on High Latency link may lead to TCP restart events / TCP Window issues, leading to slow total throughput.
Packet Loss
- Packet loss occurs when packets of data being transmitted across a network fail to reach their destination completely.
- It can be caused by network congestion, hardware failures, or signal degradation.
- High packet loss leads to interruptions and degradation in service quality, particularly affecting streaming, downloading, and online gaming.
- Packet loss causes TCP Window to reset, causing low throughput speeds. Higher the latency, the higher the effect.
- Tested with Ping typically. Internet Routers may have packet loss, since ICMP Echo packets are lowest priority handled by router's CPU.
Routes
- The path or route taken by data packets across the network significantly affects overall network performance.
- Data can traverse multiple routers and networks, each potentially impacting speed and reliability.
- Understanding the routes helps in pinpointing where delays or packet losses are occurring, especially when troubleshooting network issues.
- Routes are commonly different to each way, therefore testing both ways is important.
- Network provider cannot really affect which way 3rd party sends the packets (route), this affects downstream latency and throughput.
- Network provider can have control over the route packets leave towards to a 3rd party, this affects upstream latency and throughput.
- Routes are typically dynamic and can change often, these are rarely manually optimized per target.
By monitoring and analyzing these metrics, users can better understand the health and performance of their network connections. This knowledge is essential for diagnosing issues and optimizing network performance.
Testing methodologies
Speed Tests aka Throughput; Inherently Unreliable Testing
Speed tests can be an unreliable measure of network health. Network conditions constantly fluctuate, and third-party testing services have their limitations. Third-party testing servers are also often very busy. Especially speeds over 1Gbps can be difficult to measure. Most 3rd party testing servers are 10Gbps max.
Speed tests, while popular, can be an unreliable measure of network health due to various factors. Here are key reasons why reliance on speed tests alone is not advisable:
- Third-Party Networks: Speed tests often involve data traveling through networks outside the control of your hosting provider. These third-party networks can have varying performance due to their own traffic management policies and network health.
- Transits and Peerings: The path data takes typically goes through several links and networks, each potentially affecting speed and performance. The complexity of these routes means that a speed test to one location will yield completely different results compared to another, even if both are equidistant.
- Inconsistent Results Across Different Tests: Due to the complexities of internet routing, different speed tests can yield varying results. Each test may involve data traveling through distinct paths, encountering unique network conditions along the way.
- Network Variability: The internet's network conditions are in constant flux. This variability can result from traffic congestion, maintenance activities, and outages, all of which can temporarily impact speed test results.
- Third-Party Server Limitations: Most speed test results are dependent on the performance of third-party servers. These servers can be busy or have limitations in their capacity, especially for high-speed connections. The majority of third-party testing servers have a maximum capacity of 10Gbps, which can be insufficient for accurately measuring speeds over 1Gbps.
- Indication of Server Performance: Despite their limitations, if at least one or a few speed tests show good speeds, it's a strong indication that the server itself is functioning properly. Consistently high speeds in multiple tests, especially from different testing platforms, further reinforce this.
- Client-Side Factors: The accuracy of speed tests can also be influenced by factors on the user's end, such as local network issues, the performance of the testing device, and the browser or application used for the test. Most typical is using WI-FI. If you are using WI-FI, start diagnosing from there. Experience shows "to home" speed issues are almost always due to utilizing WI-FI.
- Limited Scope of Testing: Speed tests primarily measure bandwidth and latency but do not provide comprehensive insights into other critical aspects of network performance, such as packet loss, jitter, and the stability of the connection over time.
- User's Server Configuration: The configuration of Your server plays a crucial role in network performance. Non-standard kernel configurations, especially those related to TCP and MTU window sizes, can significantly skew test results. TCP/MTU window sizes are vital as they determine how much data can be sent before requiring an acknowledgment – in scenarios with higher latency, the impact of incorrectly set window sizes becomes more pronounced, potentially leading to reduced throughput and performance issues.
- LACP/LAG, Link Aggregation: Links are typically build from aggregating multiple links, for example a bunch of 10G links to build a 100G link. This is for costs, but limits the ultimate maximum per target or per connection performance. ie. 100G Link from 10x10G might have 50G utilization, which means single connection can only get to 5G maximum. This is normal operation.
- Guaranteed Does Not Exist In Reality: There is no such thing as guaranteed throughput, this is impossible to provide to any and all network targets, it would require full control of all devices, every piece of software, and individual fiber to every target, from every target.
- Traffic Shaping: Many ISPs are known to do traffic shaping in various ways, sometimes it is refusing to upgrade a peering or transit link, sometimes it's protocol based, sometimes user based, etc. Various methods.
- Correct Expectations: You cannot expect a 56k dialup in rural india to be able to send or receive at speeds in excess of the 56k dialup. Getting throughputs beyond 1Gbps single connection are difficult, beyond 10Gbps nigh impossible. See LACP/LAG, Link Aggregation above.
Speed tests can offer some insights into network performance, but they should only be used as part of a broader diagnostic strategy. For a more accurate assessment of network health, combining speed tests with other tools like MTR analysis is recommended. This approach helps in identifying whether network issues are indeed related to the server or if they lie elsewhere in the complex web of internet connectivity.
Speed Testing Tools
There are a lot of tools for testing, these are the most common.
Yabs.sh
Popular tool for general server performance test, while limited, this is what a lot of people run as default. It gives a hint of relative server performance, since all tests are the same it does give decent indication, for most part. It's well known that yabs.sh limited number of network speed test servers are often congested at this time.
Run yabs.sh:
wget -qO- yabs.sh | bash
network-speed.xyz
This is like yabs.sh but dedicated only for network tests, running larger number of tests, allowing choosing regionality etc.
Run network-speed.xyz test:
wget -qO- network-speed.xyz | bash
Speedtest by Ookla
speedtest-cli is another very common tool to use, or speedtest.net. This tool only tests on single server, closest it can find. Due to geolocation awareness this actually gives one of the more reliable results, a close by server. Sometimes these servers are congested as well, so testing multiple is key if first one gives you bad results; It could be that particular test server is congested.
Do not Use The Python Version -- This is known to have measurement issues.
Iperf3
Best tool for measuring point to point, this is heavily optimized and has many options for various testing methods and parallelism.
Install on Debian / Ubuntu and starting a server is simple;
apt install -y iperf3 iperf3 -s # Performing a test: iperf3 -c [Server IP Address]
Network Diagnostics; Route, Ping, Jitter
Tools
Tools for basic network diagnostics are the same commonly, while others can exist, these are the ones most typically used.
MTR, WinMTR
This is the most important and essential tool, see Network Troubleshooting with MTR for comprehensive guide. Always do a MTR test both ways, in minimum of 1000+ packets if you suspect an issue.
MTR gives you all typical information; Latency, Packet Loss, Jitter, Route -- all in one.
Ping
Test latency between 2 end points. Simple basic test to quickly check if you get an response. This should be run 1000+ packets, or preferrably hours and hours, to gather long term average data. You may lower the time between pings or run multiple concurrently.
To test, Linux or Windows;
ping [server ip or hostname, ie. google.com]
Traceroute
Trace the route to the other server, which network hops it has, how the packets are being routed _to_ the target. This only tests from to the target, you need to run this from the target as well to get complete picture, like MTR.
To test, Linux;
traceroute [server ip or hostname, ie. google.com]
To test, Windows;
tracert [server ip or hostname, ie. google.com]
Network Testing, Diagnosis and Reporting; How To Get Actual Results
Now, we get to the difficult bit.
- Isolate the issue, remove all other variables, test only single thing at a time.
- No application performance testing here. Use only specifically network testing tools (no SFTP, FTP, Email, Web, VoIP server etc.)
- Isolate the location of the issue (ie. Your Server, Other Server, Your Home/Office)
- Isolate the type of the issue
Determine if this is actual issue and set expectations
These at least are not actual issues;
- I have 56k dialup, i'm not getting 10Gbps download! 😡 - Reality check: Time travel isn't a feature yet! (Give us a few more moments to work on that!)
- Shared service, entry level service, shared/fairshare connections; Getting only say 300Mbps per connection max. It's not a bug; it's a budget!
- One Slow Target: 1 target out of 1000 is slower than others, it's likely an issue with that particular target, not your connection.
- Seedbox swarm has no leechers
- Seedbox swarm has leechers but only getting low speeds (you cannot force data to anyone!)
- Intergalactic Connectivity Woes: My friend in antarctica/north korea/space station/moon/mars/venus is only getting 5Kbps. Joke right? Well at least getting a connection. How did he get to Mars?
- Mobile and Wireless Limitations: You are on 4G, 5G or WIFI
- SFTP Downloads are slow. Isolate first, that might be testing CPU or I/O Performance, not network performance.
- Speed Overload: Over 1Gbps on single connection / thread -- more than acceptable (12/2023)
- Cannot do 100% Link 24/7/365, Did not buy dedicated link. Remember, it's a shared road, not a private racetrack.
- I paid 5€ for my service, not getting 10Gbps 24/7/365, YOU SCAMMER! The only scam here is expecting caviar at fast-food prices!
- This single random provider in Uzbekistan/Mongolia/Zimbabwe/Papua New-Guinea/Etc i can only pull X speed! (but everything else is fast)
- Patience is Key: Issue has persisted less than 48hrs and is not complete connection loss (low throughput, jitter, dropouts). The internet is a living, evolving entity.
- Misunderstanding Capacity: Can Only Get 100MiB/s on my 1Gbps Link, Not even Close To 1000MiB/s! You Fraud! - A gentle reminder: Bits and bytes are different; your math might need recalibrating!
- Selective Slowdowns: If only one random provider in a remote location is slow but everything else is fast, it's likely an issue specific to that route or provider.
- MTR Shows packet loss at Hop 5 out of 9: That's not an issue, remember routers regularly drop ICMP Echo packets. Packet loss has to persist from 1 hop onwards to the final target.
- Speed of Light Isn't Fast Enough: I clicked and it didn't happen instantly! - Unless you've discovered faster-than-light travel, a tiny delay is normal.
- I Want My Video Stream... NOW!: My streaming video buffered for a whole two seconds! - Patience, young Jedi. Even the Force needs a moment to load.
- It Works on My Friend's Cousin's Neighbor's Network (or other network XYZ): But this one random person I know gets better speeds! - And I have a friend who claims they saw a unicorn... Every network is different, just because network A is fast from network B does not mean network C behaves the same.
- Peak Hours? What's That?: Why is my transfer slow at 20:00?? Probably the same reason highway is slow at 16:30. It's a rush hour.
Bandwidth at provider level (IP Transit) is actually extremely expensive, sometimes 100s of times more expensive than you think. Providers actually pay much more for bandwidth because it's actually meant to be used at high level 24/7, with SLA and support contracts and all the networking hardware costs exponentially more than home/small office hardware. So set your expectations based on what level of service you actually bought, say if you paid 40€ for a year on shared entry level 10Gbps, you really shouldn't be expecting to be pushing petabytes of data each month -- just ability to burst at that level, most of the time. This is why 10Gbps dedicated link may cost you 750€ a month; and even then it's just dedicated to provider's network edge at maximum. Contention ratios exist everywhere in networking, everything is "oversold", especially at ISPs, these contention ratios may go even above 1000:1 - ie. 10 000x 1Gbps links sold, but they only have 10Gbps of bandwidth (home users use very very very little of bandwidth in reality).
While actual issues are important to solve, we should also recognize what are not actually issues. Most often your hosting provider also cannot do anything about it, it's almost always somewhere else where the issue resides.
First Steps
Before you call in the cavalry, try these simple steps. Often, they're all you need to solve the issue.
Connection issue to Home/Office
- Make sure you are on wired connection. Wireless is like weather – rather unpredictable.
- Reset your modem/gateway/router. It's cliché, but it works more often than you'd think.
- Check your network cables are actually attached (and intact).
- Check if this is persistent 24/7. Is the issue there all the time, or does it come and go like a mysterious ghost?
- Try with different ISP/Connection as well, does it persist? If the problem persists, it's likely not your primary connection's fault.
Connection issue to Mobile (4G, 5G, etc)
- See Connection issue to Home/Office
- Change location or wait past midnight; Issue Solved? Get proper wired connection.
- Get a proper wired connection or contact your carrier.
Dedicated Server / VPS
- Make sure you don't have "optimizations" enabled, sysctl/kernel tuning. Bring to default
- Make sure you are not using a lot of bandwidth while testing (ie. vnstat, bwm-ng)
- Make sure you don't have heavy tasks running
Write it down; Take notes; Proper Reports
Collect your data, yes, that means opening notepad! (or kate, or gedit! your favorite text editor)
After all tests are done, you may now format it and make the support request. Make certain the formatting is easy to read, and on top is the highlights of testing. Synopsis of 1-3 short sentences is important, so whoever reads your report knows exactly what to look for immediately.
No one will read your incomprehensive messy text and try to decipher what is what. No one wants to open random images without a reason, and especially no one will open random links to random sites.
Sometimes screenshots can be easier (ie. WinMTR), so feel free to save them, but do not expect anyone looks at them over the text data.
Testing Processes
Process changes depending upon what you are experiencing, here are some common ways.
Generic throughput issues consistently; No specific target / Multiple Targets
- Make sure your server is not already at high network utilization
- Repeat tests multiple times, at multiple times of the day; Determine if time of the day affects this.
- Do at least several types of tests (Yabs.sh alone is not sufficient, and will be ignored)
- Do network-speed.xyz and iperf3 tests
- Determine if this is an actual issue; Getting "Only" 900Mbps on 1Gbps link is not issue. 1 Specific Network being slow for 30minutes, is not an issue. High Latency links not getting single thread 75%+ of link speed are not an issue.
- Supply your IP with iperf3 server running when opening a support request with your provider
Throughput speed jitter; Specific target
This is when speeds are fine, but all of sudden speeds plummet and then slowly ramps up. This is when a TCP retransmit event has happened, or even packet loss. Goal is to determine if there is Jitter and/or packet loss, and is it widespread or just single target. MTR can show also route flapping.
- Do MTR Both ways, while the issue is happening
- If issue intermittent; Do MTR Both ways also while it's not happening. Mark down times of the day
- Do throughput testing from the server via network-speed.xyz _while_ happening and _while not happening_.
Connections drop randomly
We are looking for periods of heavy packet loss here, or total loss of connectivity.
- Run MTR and/or ping against the target consistently, preferrably both ways.
- Ping the gateway (these regularly have packet loss, but if it doesn't is an indicator that connection _to_ gateway works)
- Try to measure how long connection is dropping.
- If this is a server with Pulsed Media, request support to check your server ip latency & packet loss graphs (smokeping).
No Connection At All (100% Packet Loss)
Server has probably crashed, 100% packet loss for 15+ mins alone is more than sufficient for opening a ticket, no further data needed.
Submit a Report
So this is an actual issue, which keeps persisting? Time to write it up and contact support!
On the start of the ticket, make a summary of issue and supporting evidence. Be concise and clear text, nO MeSsYWriT1ng here. Include why you think there's an actual issue your provider can actually do something about, not just some ephemeral random internet shenanigans.
If the issue is not clearly at your hosting provider, check where you should submit the information actually. Your local ISP? Another hosting provider? Which network the issue actually resides at?
Finally, double check you did all the testing and made it as easy as possible for the network engineer to analyze and check if it warrants further inspection.
Be Patient; These issues sometimes takes years to resolve (!!). Networking gear is expensive and sometimes it's politics (Comcast, Cogent, Netflix, Net Neutrality ...). Single 100Gbps link may cost 100s of thousands of euros to install; It all depends if the equipment is already there, how long fiber runs etc.
Common Misconceptions
Higher Bandwidth Equals Faster Internet
Bandwidth is just the maximum possible data transfer rate, it's just one component of it all. Think how fast a ferrari goes during traffic hour in New York City / Manhattan? Yea not very fast.
Latency is a big factor, a slower connection may outperform a "faster" connection in long distance connections regularly, if the latency is lower. Throughput is factor of latency, and latency is how fast a request can theoretically be returned to you with answer.
Wired Connection Is Always Stable
Cable issues still exist. Problems can still happen due to bad cables, network jacks, hardware issues.
Speed Test Results Are Absolute
Speed tests offer a snapshot of network performance at a specific time, with specific target and under specific conditions. These results can vary due to server load, the path taken by data, and network congestion and myriad of other factors.
Low Latency is Guaranteed on High-Speed Networks
High-speed networks can still experience latency issues, especially if the data travels long distances or through congested networks. Latency depends on more than just the speed of your local connection. Light can only travel so fast, there is buffering, bad routes etc.
Packet Loss is Always a Sign of a Bad Connection
Occasional packet loss can happen, for variety of reasons. It's not a marker of connection issues, or networking issues 100% of the time. Testing needs to be done over time, and it depends on the other end. For example, routers, switches etc. regularly drop ICMP Echo packets as they are handled by CPU and have the lowest possible priority. However, if a switch/router has 0% packet loss it is still indicative there's no actual connection issues between the 2 points.
Packet loss becomes an issue when it's consistent and performance is obviously impacted (but not when tested against a router/switch network hop, by itself)
All Network Issues Can Be Resolved by the ISP or Hosting Provider
Not even remotely true. Large portion of issues reside beyond the control of the provider, in 3rd party networks, locally at the customer side etc. It's actually less likely for provider to be able to do something about an issue, than it is not. Probably around 95% is beyond the control of a ISP or hosting provider.
Any Network Problem Can Be Quickly Fixed
No. Actual network issues can be extremely convoluted and difficult to fix, if it's not just a simple "oh cable failed" level thing. See below for real life examples.
A "simple and easy fix" network issue might actually take a lot of efforts to diagnose, find and fix. Sometimes it needs coordination of multiple network operators, to follow the correct chain of command. Politics might happen as well.
Network Performance is Consistent Across All Applications
Application requirements change on per application basis, and when measuring application performance a lot more variables than "just" network enters the picture. Make sure things are isolated for testing. Some applications may open vast number of connections, others are heavily jitter dependant (VoIP), while others are throughput and latency (large datatransfers) dependant. Some ISPs also shape by the protocol being used.
Real Life Examples of Issues
The issues can be varied, from typical to atypical, to outright dark magic.
Cable Issues
These are most common, broken clips, broken network port jacks or loose jacks, dirty fiber ends. Vast majority of issues are related to cables. Sometimes all it needs is a dust speckle. With copper crosstalk is an issue too, but not for short cable runs
Fiber is very fragile
After swapping both sides of fiber modules, cleaning and washing fiber ends, it turns out that at some point that long run of fiber was probably bent too much. Error rate actually increased after swapping new fiber optics. Someone gets to crawl and replace a tight space long run of fiber ...
ASICS; Application Specific Integrated Circuits. HW Routing and Switching
Designer forgot something; VLANS!
If hardware vendor's designer forgot some case, even a edge case, those can be a nightmare to debug. No one else likely knows of this issue in your circle of colleaques, or it has not been spoken publicly about.
We had one peculiar case where speeds between 2 machines were excruciatingly slow, despite all links non-congested, and speeds were stellar to everywhere else. Turns out, one specific condition was forgotten from a really highend ASIC: What if the target for packet is actually behind the same physical link as from where it arrived? This was 2 servers, in different VLANs, under same distribution switch / top of the rack switch. So the packets travelled upstream to next aggregation layer, and had to be then sent pack to the same physical link, with different VLAN tagging.
This was not accelerated in hardware, and not in FIB, Forward Information Base of that router, so it dropped down to the CPU. Throughput at 0.1ms latency level dropped down to just few Mbps level.
Incomplete error reporting and software; Firmware bugs for caching L1/L2/L3
ARP requests would drop sometimes in single VLAN, but not all targets, and switches, just some. This VLAN had ~200 devices, with half a dozen switches. Randomly ARP requests would fail, or be very significantly delayed.
Thus causing 2 nodes in the same network could not access each other, or had intermittent packet loss / connection drops etc.
There was no error messages or anything of that sort, but by sheer luck it was found out that rebooting the aggregation layer restored full performance and reachability. For random length of time.
After many many hours, curses, probably 200hrs of work in this single task by seasoned veteran (think of the guys charging hundreds per hour!), a hunch arrived; Something might be wonky with the FIB, Forward Information Base caches off this particular system. Partitioning of it was changed, L3 routes were limited, IPv6 eliminated completely. The issue has never come back again.
Sometimes called FIB, CAM in others. This is the hardware L1 performance level caching within the ASIC. This stores things like routes, ARP tables, MAC addresses etc. for quick fetch, these have to be fetched in nanosecond scale to maintain networking performance, and has to be ultra low latency.
Turns out the firmware was either miscounting utilization of this cache _or_ the cache is damaged on-chip, something along those lines most likely. Perhaps garbage collection broken? Regardless, limiting cache use and re-partitioning it to eliminate IPv6, more for ARP and L3 routes fixed this issue. But it took months, and thousands upon thousands of euros to debug.
Router Power Issues
A small segment of network was down for unknown reason, 100% packet loss to a few servers. No alerts of anykind. But remote management was up and console showed the servers are running normally without issue?
Turns out, there was no alert generated by the router's linecard not having sufficient power. This router has power management which is ultra conservative, it shuts down linecards if it suspects peak power capacity is not sufficient to power all of them.
One of the 16A 230V feeds to the router had fuse blown for completely unrelated reasons during the day's maintenance routines, router was still receiving 4KW worth of PSU capacity and at that time "only" utilizing in the 1800W ballpark (if memory servers correct), so there was free power to be utilized, but the router emphasizes N+1 and peak consumption; So 1 linecard was shutdown, fortunately the one with least links on it.