DIY Network Troubleshooting Analyzer — Step‑by‑Step Walkthrough for IT Pros
Goal
Build a lightweight, repeatable analyzer to find root causes of connectivity, performance, and service issues using common tools (ping, traceroute, DNS checks, SNMP/flows, packet captures, logs).
Required tools (assume Linux/BSD/macOS)
- ping, traceroute (or tracepath)
- nslookup / dig
- ip / ifconfig, ss / netstat
- tcpdump, tshark, Wireshark (for PCAP analysis)
- nmap (port/service checks)
- mtr (combined ping/traceroute)
- netstat/ss + iostat/top (host resource checks)
- SNMP client (snmpwalk) and NetFlow/sFlow collector (optional)
- Log access (syslog, device logs) and SSH
One‑pass workflow (run these in order; record outputs)
-
Define scope & timeline
- Who/what is affected (single host, subnet, app, all users).
- Note start time and recent changes.
-
Quick reachability checks (2–5 min)
- Ping host(s) by IP and by name: check latency, packet loss.
- traceroute to target to locate hops with high latency/loss.
- mtr for continuous path/loss patterns.
-
Name and service resolution
- dig + nslookup for A/AAAA/CNAME and MX; compare resolver responses.
- Check DNS TTLs and authoritative responses.
- nmap to verify service ports are open and responding (use -sT or -sS per environment).
-
Local host health
- Check IP/config:
ip addr/ifconfig. - TCP state:
ss -tunap/netstat -an. - CPU/memory/disk:
top/htop,iostat,free. - Check ARP table and MAC learning on switches if L2 issues suspected.
- Check IP/config:
-
Interface & link diagnostics
- On switches/routers: check interface counters (errors, CRC, collisions), duplex/speed mismatches.
- Verify VLAN membership and trunk states.
- Review PoE status if relevant.
-
Traffic analysis
- Capture short tcpdump on affected host/interface: rotate with size/time limits. Example:
sudo tcpdump -i eth0 -w /tmp/cap.pcap -c 10000 - Use capture filters to limit noise (host, port, proto).
- Open in Wireshark/tshark to inspect retransmissions, RSTs, TCP window, TLS failures, or malformed packets.
- Use tshark/statistics or Wireshark IO graphs for patterns.
- Capture short tcpdump on affected host/interface: rotate with size/time limits. Example:
-
Flow / aggregate visibility
- Query NetFlow/sFlow/IPFIX data to find top talkers, unusual protocols, or traffic spikes.
- Correlate flow peaks with symptom times.
-
SNMP & device events
- snmpwalk for device health: CPU, memory, interface counters, temperature.
- Check device logs and syslog for errors, flaps, BGP/SPF events, ACL denies.
-
Application‑level checks
- Verify backend dependencies (DB, auth services) reachable and healthy.
- Reproduce the problem via curl/HTTP client, capturing headers and timings.
- Check application logs for errors tied to network timeouts or retries.
-
Correlate & isolate
- Map observed failures to OSI layers: physical/link → IP/routing → transport → application.
- If only one hop/site affected, check local infra; if multiple sites, check upstream ISP or core.
-
Mitigation & validation
- Apply targeted mitigations (route change, interface reset, QoS tweak, firewall rule adjust) only after backup/config capture.
- Re-run the same checks and captures to confirm resolution and collect post‑fix baselines.
-
Document & automate
- Save captures, CLI outputs, and timeline in an incident record.
- Create small scripts to automate the above checks for future incidents (example scripts below).
Minimal example scripts
- Quick host health + network snapshot (Bash pseudocode):
Code
#!/bin/bash TARGET=\(1 date > /tmp/nt_analysis_\)TARGET.txt ping -c 5 \(TARGET >> /tmp/nt_analysis_\)TARGET.txt traceroute -n \(TARGET >> /tmp/nt_analysis_\)TARGET.txt ss -tunap >> /tmp/ntanalysis\(TARGET.txt sudo tcpdump -i eth0 host \)TARGET -c 100 -w /tmp/${TARGET}cap.pcap
- Rotating tcpdump (logrotate style):
Code
sudo tcpdump -i eth0 -W 10 -C 100 -w /var/tmp/capture-%Y%m%d-%H%M.pcap
Quick troubleshooting checks matrix (when to use)
- Intermittent latency: run mtr; capture TCP retransmits.
- Complete loss: check interface counters, ARP, routing, ACLs, upstream provider.
- DNS failures: query authoritative servers, check resolver config and timeouts.
- Slow app but good network metrics: investigate application threads, DB latency, or proxy issues.
- High bandwidth: use NetFlow/Top talkers and QoS queue stats.
Post‑incident recommendations
- Save PCAPs and logs for 30–90 days depending on policy.
- Build synthetic probes (ping, HTTP checks) from multiple locations.
- Establish baseline metrics for latency, loss, utilization.
- Automate the one‑pass workflow as a runbook and add cron/monitor alerts.
- Apply config management and change approval to reduce human‑caused regressions.
If you want, I can produce:
- a ready-to-run Bash script bundle for the one‑pass workflow tailored to Linux,
- or a printable one‑page runbook for on‑call use.
Leave a Reply