春江暮客

春江暮客的个人学习分享网站

Python Log Triage: Use rg and uv to Find Nginx 5xx Errors Fast

2026-06-03 Technology
Python Log Triage: Use rg and uv to Find Nginx 5xx Errors Fast

When a production service starts failing, the first useful step is usually not building a complex platform. It is answering three questions: when did the errors start, which URLs failed most often, and whether the failures are concentrated around specific clients or upstream services.

If the log is large, opening it in an editor is slow. A more practical workflow is to use rg to filter suspicious lines first, then use a small Python script for the summary. This tutorial uses Nginx access logs and builds a workflow you can copy to a server.

By the end, you will have:

  1. A set of useful rg commands for log searching
  2. A Python analysis script that runs with uv run
  3. A 5xx summary grouped by status code, URL, and IP
  4. Direct fixes for common errors

When this workflow is useful

This method is useful for temporary troubleshooting and lightweight automation:

  1. Nginx, Apache, or application access logs are large
  2. You need to find 500, 502, 503, or 504 quickly
  3. You do not have a full log platform yet, or the platform query is inconvenient
  4. You want to turn one-off troubleshooting commands into a repeatable script

If you already have ELK, Loki, or a cloud logging service, this workflow is still worth keeping. Local server triage is often more direct.

Method 1: Narrow the log with rg first

Create a test directory:

mkdir nginx-log-triage
cd nginx-log-triage

Write a small sample log:

cat > access.log <<'EOF'
203.0.113.10 - - [03/Jun/2026:07:40:01 +0800] "GET / HTTP/1.1" 200 612 "-" "curl/8.0"
203.0.113.11 - - [03/Jun/2026:07:40:03 +0800] "GET /api/orders HTTP/1.1" 502 173 "-" "Mozilla/5.0"
203.0.113.12 - - [03/Jun/2026:07:40:08 +0800] "POST /api/login HTTP/1.1" 500 91 "-" "Mozilla/5.0"
203.0.113.11 - - [03/Jun/2026:07:41:15 +0800] "GET /api/orders HTTP/1.1" 504 173 "-" "Mozilla/5.0"
203.0.113.13 - - [03/Jun/2026:07:42:20 +0800] "GET /assets/app.css HTTP/1.1" 200 2048 "-" "Mozilla/5.0"
203.0.113.14 - - [03/Jun/2026:07:43:11 +0800] "GET /api/orders HTTP/1.1" 502 173 "-" "Mozilla/5.0"
EOF

Find all 5xx lines:

rg -n '" 5[0-9]{2} ' access.log

Only show 502 and 504:

rg -n '" (502|504) ' access.log

Show one line of context before and after each match:

rg -n -C 1 '" 5[0-9]{2} ' access.log

If logs are split across multiple files:

rg -n '" 5[0-9]{2} ' /var/log/nginx -g '*.log'

The goal of this step is not the final report. It is to quickly confirm whether the errors exist, where they are concentrated, and whether deeper counting is needed.

Method 2: Run a Python summary script with uv

rg is great for fast filtering. If you need to count the busiest URLs and client IPs, Python is more reliable.

Create a script:

uv init --script log_report.py --python 3.12

Replace log_report.py with this:

# /// script
# requires-python = ">=3.12"
# ///

from __future__ import annotations

import argparse
import re
from collections import Counter
from pathlib import Path


LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) [^"]+" '
    r'(?P<status>\d{3}) (?P<size>\S+)'
)


def iter_records(path: Path):
    with path.open(encoding="utf-8", errors="replace") as file:
        for line_number, line in enumerate(file, start=1):
            match = LOG_PATTERN.search(line)
            if not match:
                continue
            record = match.groupdict()
            record["line"] = str(line_number)
            yield record


def main() -> None:
    parser = argparse.ArgumentParser(description="Summarize Nginx 5xx access log entries.")
    parser.add_argument("log_file", type=Path)
    parser.add_argument("--status", default="5", help="Status prefix, for example 5 or 50")
    parser.add_argument("--top", type=int, default=5)
    args = parser.parse_args()

    status_count: Counter[str] = Counter()
    path_count: Counter[str] = Counter()
    ip_count: Counter[str] = Counter()
    first_seen: str | None = None
    last_seen: str | None = None

    for record in iter_records(args.log_file):
        status = record["status"]
        if not status.startswith(args.status):
            continue

        status_count[status] += 1
        path_count[record["path"]] += 1
        ip_count[record["ip"]] += 1
        first_seen = first_seen or record["time"]
        last_seen = record["time"]

    total = sum(status_count.values())
    print(f"Total matched requests: {total}")
    print(f"Time range: {first_seen or 'n/a'} -> {last_seen or 'n/a'}")

    print("\nStatus:")
    for status, count in status_count.most_common():
        print(f"  {status}: {count}")

    print("\nTop paths:")
    for path, count in path_count.most_common(args.top):
        print(f"  {count:>4}  {path}")

    print("\nTop client IPs:")
    for ip, count in ip_count.most_common(args.top):
        print(f"  {count:>4}  {ip}")


if __name__ == "__main__":
    main()

Run it:

uv run log_report.py access.log

Validate the output

Expected output should look like this:

Total matched requests: 4
Time range: 03/Jun/2026:07:40:03 +0800 -> 03/Jun/2026:07:43:11 +0800

Status:
  502: 2
  500: 1
  504: 1

Top paths:
     3  /api/orders
     1  /api/login

Top client IPs:
     2  203.0.113.11
     1  203.0.113.12
     1  203.0.113.14

This already gives you useful next steps:

  1. /api/orders is the most concentrated failing URL
  2. 502 appears more often than 500, so upstream health or reverse proxy behavior should be checked first
  3. The failures happened between 07:40 and 07:43, so you can compare that range with application logs and deployment time

Method 3: Use it with real server logs

On a server, copy the script and run it against the real log:

uv run log_report.py /var/log/nginx/access.log

Only count 502:

uv run log_report.py /var/log/nginx/access.log --status 502

Show a larger ranking:

uv run log_report.py /var/log/nginx/access.log --top 20

If you only want to analyze recently appended lines, create a temporary file first:

tail -n 20000 /var/log/nginx/access.log > recent-access.log
uv run log_report.py recent-access.log

This avoids repeatedly scanning the full log and makes incident triage faster.

Troubleshooting flow

After you have the summary, continue in this order:

  1. Many 500 responses: check application error logs, stack traces, and database connection errors
  2. Many 502 responses: check whether the upstream service is alive, whether the port is correct, and whether reverse proxy timeouts are involved
  3. Many 503 responses: check rate limits, maintenance mode, and whether the service pool has available instances
  4. Many 504 responses: check slow queries, external APIs, upstream response time, and Nginx timeout settings

For example, check the Nginx error log first:

rg -n "upstream|timeout|connect\\(\\) failed|refused" /var/log/nginx/error.log

Then check the same time range in the application log:

rg -n "07:4[0-3]|ERROR|Traceback|Exception" /path/to/app.log

Common errors

1. rg does not find any 5xx lines

First confirm whether your log format has spaces around the status code. The commands in this article match this part of the Nginx combined log format:

"GET /api/orders HTTP/1.1" 502 173

If your log is JSON, search the JSON field directly:

rg -n '"status":50[0-9]' access.jsonl

2. The Python script returns 0

The usual cause is a log format mismatch. Print one real line first:

head -n 1 access.log

If the field order differs from the sample, adjust LOG_PATTERN. During an incident, you can also use rg and awk first instead of building a universal parser immediately.

3. Permission is denied when reading logs

If the current user cannot read /var/log/nginx/access.log, check permissions:

ls -l /var/log/nginx/access.log

For temporary troubleshooting:

sudo tail -n 20000 /var/log/nginx/access.log > recent-access.log
sudo chown "$USER":"$USER" recent-access.log
uv run log_report.py recent-access.log

4. Logs are compressed

rg does not search .gz content directly by default. First confirm with zgrep:

zgrep -n '" 5[0-9][0-9] ' /var/log/nginx/access.log.1.gz | head

If you need the Python summary, decompress to a temporary file first:

gzip -dc /var/log/nginx/access.log.1.gz > old-access.log
uv run log_report.py old-access.log

Summary

Log triage does not always need a complex platform first. rg quickly finds suspicious lines in large files, Python turns those clues into status, URL, and IP rankings, and uv makes the script easy to run on a new machine.

This workflow is useful for temporary server incidents and for small internal tools. The next time you see a 5xx spike, use these commands to narrow the problem before deciding whether to inspect the application, database, or upstream service.

友情链接

其它