This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Observability

Tutorials for monitoring and observing AgentHub systems

Observability Tutorials

Learn how to monitor, trace, and observe your AgentHub deployments with comprehensive observability features.

Available Tutorials

1 - Interactive Dashboard Tour

Take a guided tour through AgentHub’s Grafana dashboards while the system is running, learning to interpret metrics, identify issues, and understand system behavior in real-time.

Interactive Dashboard Tour

Learn by doing: Take a guided tour through AgentHub’s Grafana dashboards while the system is running, learning to interpret metrics, identify issues, and understand system behavior in real-time.

Prerequisites

  • Observability stack running (from the Observability Demo)
  • Observable agents running (broker, publisher, subscriber)
  • Grafana open at http://localhost:3333
  • 10-15 minutes for the complete tour

Quick Setup Reminder

If you haven’t completed the observability demo yet:

# Start observability stack
cd agenthub/observability
docker-compose up -d

# Run observable agents (3 terminals)
go run broker/main.go
go run agents/subscriber/main.go
go run agents/publisher/main.go

Dashboard Navigation

Accessing the Main Dashboard

  1. Open Grafana: http://localhost:3333
  2. Login: admin / admin (skip password change for demo)
  3. Navigate: Dashboards β†’ Browse β†’ AgentHub β†’ “AgentHub EDA System Observatory”
  4. Bookmark: Save this URL for quick access: http://localhost:3333/d/agenthub-eda-dashboard

Dashboard Layout Overview

The dashboard is organized in 4 main rows:

🎯 Row 1: Event Processing Overview
β”œβ”€β”€ Event Processing Rate (events/sec)
└── Event Processing Error Rate (%)

πŸ“Š Row 2: Event Analysis
β”œβ”€β”€ Event Types Distribution (pie chart)
└── Event Processing Latency (p50, p95, p99)

πŸ” Row 3: Distributed Tracing
└── Jaeger Integration Panel

πŸ’» Row 4: System Health
β”œβ”€β”€ Service CPU Usage (%)
β”œβ”€β”€ Service Memory Usage (MB)
β”œβ”€β”€ Go Goroutines Count
└── Service Health Status

Interactive Tour

Tour 1: Understanding Event Flow (3 minutes)

Step 1: Watch the Event Processing Rate

Location: Top-left panel What to observe: Real-time lines showing events per second

  1. Identify the services:

    • Green line: agenthub-broker (should be highest - processes all events)
    • Blue line: agenthub-publisher (events being created)
    • Orange line: agenthub-subscriber (events being processed)
  2. Watch the pattern:

    • Publisher creates bursts of events
    • Broker immediately processes them (routing)
    • Subscriber processes them shortly after
  3. Understand the flow:

    Publisher (creates) β†’ Broker (routes) β†’ Subscriber (processes)
         50/sec      β†’      150/sec     β†’      145/sec
    

πŸ’‘ Tour Insight: The broker rate is higher because it processes both incoming tasks AND outgoing results.

Step 2: Monitor Error Rates

Location: Top-right panel (gauge) What to observe: Error percentage gauge

  1. Healthy system: Should show 0-2% (green zone)

  2. If you see higher errors:

    • Check if all services are running
    • Look for red traces in Jaeger (we’ll do this next)
  3. Error rate calculation:

    Error Rate = (Failed Events / Total Events) Γ— 100
    

🎯 Action: Note your current error rate - we’ll compare it later.

Tour 2: Event Analysis Deep Dive (3 minutes)

Step 3: Explore Event Types

Location: Middle-left panel (pie chart) What to observe: Distribution of different event types

  1. Identify event types:

    • greeting: Most common (usually 40-50%)
    • math_calculation: Compute-heavy tasks (30-40%)
    • random_number: Quick tasks (15-25%)
    • unknown_task: Error-generating tasks (2-5%)
  2. Business insights:

    • Larger slices = more frequent tasks
    • Small red slice = intentional error tasks for testing

πŸ’‘ Tour Insight: The publisher randomly generates different task types to simulate real-world workload diversity.

Step 4: Analyze Processing Latency

Location: Middle-right panel What to observe: Three latency lines (p50, p95, p99)

  1. Understand percentiles:

    • p50 (blue): 50% of events process faster than this
    • p95 (green): 95% of events process faster than this
    • p99 (red): 99% of events process faster than this
  2. Healthy ranges:

    • p50: < 50ms (very responsive)
    • p95: < 200ms (good performance)
    • p99: < 500ms (acceptable outliers)
  3. Pattern recognition:

    • Spiky p99 = occasional slow tasks (normal)
    • Rising p50 = systemic slowdown (investigate)
    • Flat lines = no activity or measurement issues

🎯 Action: Hover over the lines to see exact values at different times.

Tour 3: Distributed Tracing Exploration (4 minutes)

Step 5: Jump into Jaeger

Location: Middle section - “Distributed Traces” panel Action: Click the “Explore” button

This opens Jaeger in a new tab. Let’s explore:

  1. In Jaeger UI:

    • Service dropdown: Select “agenthub-broker”
    • Operation: Leave as “All”
    • Click “Find Traces”
  2. Pick a trace to examine:

    • Look for traces that show multiple spans
    • Click on any trace line to open details
  3. Understand the trace structure:

    Timeline View:
    agenthub-publisher: publish_event [2ms]
      └── agenthub-broker: process_event [1ms]
          └── agenthub-subscriber: consume_event [3ms]
              └── agenthub-subscriber: process_task [15ms]
                  └── agenthub-subscriber: publish_result [2ms]
    
  4. Explore span details:

    • Click individual spans to see:
      • Tags: event_type, event_id, agent names
      • Process: Which service handled the span
      • Duration: Exact timing information

πŸ’‘ Tour Insight: Each event creates a complete “trace” showing its journey from creation to completion.

Step 6: Find and Analyze an Error

  1. Search for error traces:

    • In Jaeger, add tag filter: error=true
    • Or look for traces with red spans
  2. Examine the error trace:

    • Red spans indicate errors
    • Error tags show the error type and message
    • Stack traces help with debugging
  3. Follow the error propagation:

    • See how errors affect child spans
    • Notice error context in span attributes

🎯 Action: Find a trace with “unknown_task” event type - these are designed to fail for demonstration.

Tour 4: System Health Monitoring (3 minutes)

Step 7: Monitor Resource Usage

Location: Bottom row panels What to observe: System resource consumption

  1. CPU Usage Panel (Bottom-left):

    • Normal range: 10-50% for demo workload
    • Watch for: Sustained high CPU (>70%)
    • Services comparison: See which service uses most CPU
  2. Memory Usage Panel (Bottom-center-left):

    • Normal range: 30-80MB per service for demo
    • Watch for: Continuously growing memory (memory leaks)
    • Pattern: Sawtooth = normal GC, steady growth = potential leak
  3. Goroutines Panel (Bottom-center-right):

    • Normal range: 10-50 goroutines per service
    • Watch for: Continuously growing count (goroutine leaks)
    • Pattern: Stable baseline with activity spikes

Step 8: Verify Service Health

Location: Bottom-right panel What to observe: Service up/down status

  1. Health indicators:

    • Green: Service healthy and responding
    • Red: Service down or health check failing
    • Yellow: Service degraded but operational
  2. Health check details:

    • Each service exposes /health endpoint
    • Prometheus monitors these endpoints
    • Dashboard shows aggregated status

🎯 Action: Open http://localhost:8080/health in a new tab to see raw health data.

Tour 5: Time-based Analysis (2 minutes)

Step 9: Change Time Ranges

Location: Top-right of dashboard (time picker) Current: Likely showing “Last 5 minutes”

  1. Try different ranges:

    • Last 15 minutes: See longer trends
    • Last 1 hour: See full demo session
    • Custom range: Pick specific time period
  2. Observe pattern changes:

    • Longer ranges: Show trends and patterns
    • Shorter ranges: Show real-time detail
    • Custom ranges: Zoom into specific incidents

Step 10: Use Dashboard Filters

Location: Top of dashboard - variable dropdowns

  1. Service Filter:

    • Select “All” to see everything
    • Pick specific service to focus analysis
    • Useful for isolating service-specific issues
  2. Event Type Filter:

    • Filter to specific event types
    • Compare performance across task types
    • Identify problematic event categories

πŸ’‘ Tour Insight: Filtering helps you drill down from system-wide view to specific components or workloads.

Hands-on Experiments

Experiment 1: Create a Service Outage

Goal: See how the dashboard shows service failures

  1. Stop the subscriber:

    # In subscriber terminal, press Ctrl+C
    
  2. Watch the dashboard changes:

    • Error rate increases (top-right gauge turns red)
    • Subscriber metrics disappear from bottom panels
    • Service health shows subscriber as down
  3. Check Jaeger for failed traces:

    • Look for traces that don’t complete
    • See where the chain breaks
  4. Restart subscriber:

    go run agents/subscriber/main.go
    

🎯 Learning: Dashboard immediately shows impact of service failures.

Experiment 2: Generate High Load

Goal: See system behavior under stress

  1. Modify publisher to generate more events:

    # Edit agents/publisher/main.go
    # Change: time.Sleep(5 * time.Second)
    # To:     time.Sleep(1 * time.Second)
    
  2. Watch dashboard changes:

    • Processing rate increases
    • Latency may increase
    • CPU/memory usage grows
  3. Observe scaling behavior:

    • How does the system handle increased load?
    • Do error rates increase?
    • Where are the bottlenecks?

🎯 Learning: Dashboard shows system performance characteristics under load.

Dashboard Interpretation Guide

What Good Looks Like

βœ… Event Processing Rate: Steady activity matching workload βœ… Error Rate: < 5% (green zone) βœ… Event Types: Expected distribution βœ… Latency: p95 < 200ms, p99 < 500ms βœ… CPU Usage: < 50% sustained βœ… Memory: Stable or slow growth with GC cycles βœ… Goroutines: Stable baseline with activity spikes βœ… Service Health: All services green/up

Warning Signs

⚠️ Error Rate: 5-10% (yellow zone) ⚠️ Latency: p95 > 200ms or rising trend ⚠️ CPU: Sustained > 70% ⚠️ Memory: Continuous growth without GC ⚠️ Missing data: Gaps in metrics (service issues)

Critical Issues

🚨 Error Rate: > 10% (red zone) 🚨 Latency: p95 > 500ms 🚨 CPU: Sustained > 90% 🚨 Memory: Rapid growth or OOM 🚨 Service Health: Any service showing red/down 🚨 Traces: Missing or broken trace chains

Next Steps After the Tour

For Daily Operations:

  • Bookmark: Save dashboard URL for quick access
  • Set up alerts: Configure notifications for critical metrics
  • Create views: Use filters to create focused views for your team

For Development:

For Deep Understanding:

Troubleshooting Tour Issues

IssueSolution
Dashboard shows no dataVerify observability environment variables are set
Grafana won’t loadCheck docker-compose ps in observability/
Metrics missingVerify Prometheus targets at http://localhost:9090/targets
Jaeger emptyEnsure trace context propagation is working

πŸŽ‰ Congratulations! You’ve completed the interactive dashboard tour and learned to read AgentHub’s observability signals like a pro!

🎯 Ready for More?

Master the Tools: Use Grafana Dashboards - Advanced dashboard usage

Troubleshoot Issues: Debug with Distributed Tracing - Use Jaeger effectively

2 - AgentHub Observability Demo Tutorial

Experience the complete observability stack with distributed tracing, real-time metrics, and intelligent alerting in under 10 minutes through hands-on learning.

AgentHub Observability Demo Tutorial

Learn by doing: Experience the complete observability stack with distributed tracing, real-time metrics, and intelligent alerting in under 10 minutes.

What You’ll Learn

By the end of this tutorial, you’ll have:

  • βœ… Seen distributed traces flowing across multiple agents
  • βœ… Monitored real-time metrics in beautiful Grafana dashboards
  • βœ… Understood event correlation through trace IDs
  • βœ… Experienced intelligent alerting when things go wrong
  • βœ… Explored the complete observability stack components

Prerequisites

  • Go 1.24+ installed
  • Docker and Docker Compose installed
  • Environment variables configured (see Installation and Setup)
  • 10 minutes of your time
  • Basic terminal knowledge

πŸ’‘ Environment Note: AgentHub agents automatically enable observability when JAEGER_ENDPOINT is configured. See Environment Variables Reference for all configuration options.

Step 1: Clone and Setup (1 minute)

# Clone the repository
git clone https://github.com/owulveryck/agenthub.git
cd agenthub

# Verify you have the observability files
ls observability/
# You should see: docker-compose.yml, grafana/, prometheus/, etc.

Step 2: Start the Observability Stack (2 minutes)

# Navigate to observability directory
cd observability

# Start all monitoring services
docker-compose up -d

# Verify services are running
docker-compose ps

Expected Output:

NAME                    COMMAND                  SERVICE             STATUS
agenthub-grafana        "/run.sh"                grafana             running
agenthub-jaeger         "/go/bin/all-in-one"     jaeger              running
agenthub-prometheus     "/bin/prometheus --c…"   prometheus          running
agenthub-otel-collector "/otelcol-contrib --…"   otel-collector      running

🎯 Checkpoint 1: All services should be “running”. If not, check Docker logs: docker-compose logs <service-name>

Step 3: Access the Dashboards (1 minute)

Open these URLs in your browser (keep them open in tabs):

ServiceURLPurpose
Grafanahttp://localhost:3333Main observability dashboard
Jaegerhttp://localhost:16686Distributed tracing
Prometheushttp://localhost:9090Raw metrics and alerts

Grafana Login: admin / admin (skip password change for demo)

🎯 Checkpoint 2: You should see Grafana’s welcome page and Jaeger’s empty trace list.

Step 4: Start the Observable Broker (1 minute)

Open a new terminal and navigate back to the project root:

# From agenthub root directory
go run broker/main.go

Expected Output:

time=2025-09-28T21:00:00.000Z level=INFO msg="Starting health server on port 8080"
time=2025-09-28T21:00:00.000Z level=INFO msg="AgentHub broker gRPC server with observability listening" address="[::]:50051" health_endpoint="http://localhost:8080/health" metrics_endpoint="http://localhost:8080/metrics"

🎯 Checkpoint 3:

  • Broker is listening on port 50051
  • Health endpoint available at http://localhost:8080/health
  • Metrics endpoint available at http://localhost:8080/metrics

Step 5: Start the Observable Subscriber (1 minute)

Open another terminal:

go run agents/subscriber/main.go

Expected Output:

time=2025-09-28T21:00:01.000Z level=INFO msg="Starting health server on port 8082"
time=2025-09-28T21:00:01.000Z level=INFO msg="Starting observable subscriber"
time=2025-09-28T21:00:01.000Z level=INFO msg="Agent started with observability. Listening for events and tasks."

🎯 Checkpoint 4:

  • Subscriber is connected and listening
  • Health available at http://localhost:8082/health

Step 6: Generate Events with the Publisher (2 minutes)

Open a third terminal:

go run agents/publisher/main.go

Expected Output:

time=2025-09-28T21:00:02.000Z level=INFO msg="Starting health server on port 8081"
time=2025-09-28T21:00:02.000Z level=INFO msg="Starting observable publisher demo"
time=2025-09-28T21:00:02.000Z level=INFO msg="Publishing task" task_id=task_greeting_1727557202 task_type=greeting responder_agent_id=agent_demo_subscriber
time=2025-09-28T21:00:02.000Z level=INFO msg="Task published successfully" task_id=task_greeting_1727557202 task_type=greeting

🎯 Checkpoint 5: You should see:

  • Publisher creating and sending tasks
  • Subscriber receiving and processing tasks
  • Broker routing messages between them

Step 7: Explore Real-time Metrics in Grafana (2 minutes)

  1. Go to Grafana: http://localhost:3333
  2. Navigate to Dashboards β†’ Browse β†’ AgentHub β†’ “AgentHub EDA System Observatory”
  3. Observe the real-time data:

What You’ll See:

Event Processing Rate (Top Left)

  • Lines showing events/second for each service
  • Should show activity spikes when publisher runs

Error Rate (Top Right)

  • Gauge showing error percentage
  • Should be green (< 5% errors)

Event Types Distribution (Middle Left)

  • Pie chart showing task types: greeting, math_calculation, random_number
  • Different colors for each task type

Processing Latency (Middle Right)

  • Three lines: p50, p95, p99 latencies
  • Should show sub-second processing times

System Health (Bottom)

  • CPU usage, memory usage, goroutines
  • Service health status (all should be UP)

🎯 Checkpoint 6: Dashboard should show live metrics with recent activity.

Step 8: Explore Distributed Traces in Jaeger (2 minutes)

  1. Go to Jaeger: http://localhost:16686
  2. Select Service: Choose “agenthub-broker” from dropdown
  3. Click “Find Traces”
  4. Click on any trace to see details

What You’ll See:

Complete Event Journey:

agenthub-publisher: publish_event (2ms)
  └── agenthub-broker: process_event (1ms)
      └── agenthub-subscriber: consume_event (5ms)
          └── agenthub-subscriber: process_task (15ms)
              └── agenthub-subscriber: publish_result (2ms)

Trace Details:

  • Span Tags: event_id, event_type, service names
  • Timing Information: Exact start/end times and durations
  • Log Correlation: Each span linked to structured logs

Error Detection:

  • Look for red spans indicating errors
  • Trace the “unknown_task” type to see how errors propagate

🎯 Checkpoint 7: You should see complete traces showing the full event lifecycle.

Step 9: Correlate Logs with Traces (1 minute)

  1. Copy a trace ID from Jaeger (the long hex string)

  2. Check broker logs for that trace ID:

    # In your broker terminal, look for lines like:
    time=2025-09-28T21:00:02.000Z level=INFO msg="Received task request" task_id=task_greeting_1727557202 trace_id=a1b2c3d4e5f6...
    
  3. Check subscriber logs for the same trace ID

🎯 Checkpoint 8: You should find the same trace_id in logs across multiple services.

Step 10: Experience Intelligent Alerting (Optional)

To see alerting in action:

  1. Simulate errors by stopping the subscriber:

    # In subscriber terminal, press Ctrl+C
    
  2. Keep publisher running (it will fail to process tasks)

  3. Check Prometheus alerts:

    • Go to http://localhost:9090/alerts
    • After ~5 minutes, you should see “HighEventProcessingErrorRate” firing
  4. Restart subscriber to clear the alert

πŸŽ‰ Congratulations!

You’ve successfully experienced the complete AgentHub observability stack!

Summary: What You Accomplished

βœ… Deployed a complete observability stack with Docker Compose βœ… Ran observable agents with automatic instrumentation βœ… Monitored real-time metrics in Grafana dashboards βœ… Traced event flows across multiple services with Jaeger βœ… Correlated logs with traces using trace IDs βœ… Experienced intelligent alerting with Prometheus βœ… Understood the complete event lifecycle from publisher to subscriber

Key Observability Concepts You Learned

Distributed Tracing

  • Events get unique trace IDs that follow them everywhere
  • Each processing step creates a “span” with timing information
  • Complete request flows are visible across service boundaries

Metrics Collection

  • 47+ different metrics automatically collected
  • Real-time visualization of system health and performance
  • Historical data for trend analysis

Structured Logging

  • All logs include trace context for correlation
  • Consistent format across all services
  • Easy debugging and troubleshooting

Intelligent Alerting

  • Proactive monitoring for error rates and performance
  • Automatic notifications when thresholds are exceeded
  • Helps prevent issues before they impact users

Next Steps

For Development:

For Operations:

For Understanding:

Troubleshooting

IssueSolution
Services won’t startRun docker-compose down && docker-compose up -d
No metrics in GrafanaCheck Prometheus targets: http://localhost:9090/targets
No traces in JaegerVerify JAEGER_ENDPOINT environment variable is set correctly
Permission errorsEnsure Docker has proper permissions

Clean Up

When you’re done exploring:

# Stop the observability stack
cd observability
docker-compose down

# Stop the Go applications
# Press Ctrl+C in each terminal running the agents

🎯 Ready for More?

Production Usage: Add Observability to Your Agent

Deep Understanding: Distributed Tracing Explained