This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Observability

Tutorials for monitoring and observing AgentHub systems

1: Interactive Dashboard Tour
2: AgentHub Observability Demo Tutorial

Observability Tutorials

Learn how to monitor, trace, and observe your AgentHub deployments with comprehensive observability features.

Available Tutorials

Observability Demo - Hands-on tutorial with tracing and metrics
Dashboard Tour - Explore the monitoring dashboards and visualizations

1 - Interactive Dashboard Tour

Take a guided tour through AgentHub’s Grafana dashboards while the system is running, learning to interpret metrics, identify issues, and understand system behavior in real-time.

Interactive Dashboard Tour

Learn by doing: Take a guided tour through AgentHub’s Grafana dashboards while the system is running, learning to interpret metrics, identify issues, and understand system behavior in real-time.

Prerequisites

Observability stack running (from the Observability Demo)
Observable agents running (broker, publisher, subscriber)
Grafana open at http://localhost:3333
10-15 minutes for the complete tour

Quick Setup Reminder

If you haven’t completed the observability demo yet:

# Start observability stack
cd agenthub/observability
docker-compose up -d

# Run observable agents (3 terminals)
go run broker/main.go
go run agents/subscriber/main.go
go run agents/publisher/main.go

Accessing the Main Dashboard

Open Grafana: http://localhost:3333
Login: admin / admin (skip password change for demo)
Navigate: Dashboards → Browse → AgentHub → “AgentHub EDA System Observatory”
Bookmark: Save this URL for quick access: http://localhost:3333/d/agenthub-eda-dashboard

Dashboard Layout Overview

The dashboard is organized in 4 main rows:

🎯 Row 1: Event Processing Overview
├── Event Processing Rate (events/sec)
└── Event Processing Error Rate (%)

📊 Row 2: Event Analysis
├── Event Types Distribution (pie chart)
└── Event Processing Latency (p50, p95, p99)

🔍 Row 3: Distributed Tracing
└── Jaeger Integration Panel

💻 Row 4: System Health
├── Service CPU Usage (%)
├── Service Memory Usage (MB)
├── Go Goroutines Count
└── Service Health Status

Interactive Tour

Tour 1: Understanding Event Flow (3 minutes)

Step 1: Watch the Event Processing Rate

Location: Top-left panel What to observe: Real-time lines showing events per second

Identify the services:
- Green line: agenthub-broker (should be highest - processes all events)
- Blue line: agenthub-publisher (events being created)
- Orange line: agenthub-subscriber (events being processed)
Watch the pattern:
- Publisher creates bursts of events
- Broker immediately processes them (routing)
- Subscriber processes them shortly after

Understand the flow:

Publisher (creates) → Broker (routes) → Subscriber (processes)
     50/sec      →      150/sec     →      145/sec

💡 Tour Insight: The broker rate is higher because it processes both incoming tasks AND outgoing results.

Step 2: Monitor Error Rates

Location: Top-right panel (gauge) What to observe: Error percentage gauge

Healthy system: Should show 0-2% (green zone)
If you see higher errors:
- Check if all services are running
- Look for red traces in Jaeger (we’ll do this next)

Error rate calculation:

Error Rate = (Failed Events / Total Events) × 100

🎯 Action: Note your current error rate - we’ll compare it later.

Tour 2: Event Analysis Deep Dive (3 minutes)

Step 3: Explore Event Types

Location: Middle-left panel (pie chart) What to observe: Distribution of different event types

Identify event types:
- greeting: Most common (usually 40-50%)
- math_calculation: Compute-heavy tasks (30-40%)
- random_number: Quick tasks (15-25%)
- unknown_task: Error-generating tasks (2-5%)
Business insights:
- Larger slices = more frequent tasks
- Small red slice = intentional error tasks for testing

💡 Tour Insight: The publisher randomly generates different task types to simulate real-world workload diversity.

Step 4: Analyze Processing Latency

Location: Middle-right panel What to observe: Three latency lines (p50, p95, p99)

Understand percentiles:
- p50 (blue): 50% of events process faster than this
- p95 (green): 95% of events process faster than this
- p99 (red): 99% of events process faster than this
Healthy ranges:
- p50: < 50ms (very responsive)
- p95: < 200ms (good performance)
- p99: < 500ms (acceptable outliers)
Pattern recognition:
- Spiky p99 = occasional slow tasks (normal)
- Rising p50 = systemic slowdown (investigate)
- Flat lines = no activity or measurement issues

🎯 Action: Hover over the lines to see exact values at different times.

Tour 3: Distributed Tracing Exploration (4 minutes)

Step 5: Jump into Jaeger

Location: Middle section - “Distributed Traces” panel Action: Click the “Explore” button

This opens Jaeger in a new tab. Let’s explore:

In Jaeger UI:
- Service dropdown: Select “agenthub-broker”
- Operation: Leave as “All”
- Click “Find Traces”
Pick a trace to examine:
- Look for traces that show multiple spans
- Click on any trace line to open details

Understand the trace structure:

Timeline View:
agenthub-publisher: publish_event [2ms]
  └── agenthub-broker: process_event [1ms]
      └── agenthub-subscriber: consume_event [3ms]
          └── agenthub-subscriber: process_task [15ms]
              └── agenthub-subscriber: publish_result [2ms]

Explore span details:
- Click individual spans to see:
  - Tags: event_type, event_id, agent names
  - Process: Which service handled the span
  - Duration: Exact timing information

💡 Tour Insight: Each event creates a complete “trace” showing its journey from creation to completion.

Step 6: Find and Analyze an Error

Search for error traces:
- In Jaeger, add tag filter: error=true
- Or look for traces with red spans
Examine the error trace:
- Red spans indicate errors
- Error tags show the error type and message
- Stack traces help with debugging
Follow the error propagation:
- See how errors affect child spans
- Notice error context in span attributes

🎯 Action: Find a trace with “unknown_task” event type - these are designed to fail for demonstration.

Tour 4: System Health Monitoring (3 minutes)

Step 7: Monitor Resource Usage

Location: Bottom row panels What to observe: System resource consumption

CPU Usage Panel (Bottom-left):
- Normal range: 10-50% for demo workload
- Watch for: Sustained high CPU (>70%)
- Services comparison: See which service uses most CPU
Memory Usage Panel (Bottom-center-left):
- Normal range: 30-80MB per service for demo
- Watch for: Continuously growing memory (memory leaks)
- Pattern: Sawtooth = normal GC, steady growth = potential leak
Goroutines Panel (Bottom-center-right):
- Normal range: 10-50 goroutines per service
- Watch for: Continuously growing count (goroutine leaks)
- Pattern: Stable baseline with activity spikes

Step 8: Verify Service Health

Location: Bottom-right panel What to observe: Service up/down status

Health indicators:
- Green: Service healthy and responding
- Red: Service down or health check failing
- Yellow: Service degraded but operational
Health check details:
- Each service exposes /health endpoint
- Prometheus monitors these endpoints
- Dashboard shows aggregated status

🎯 Action: Open http://localhost:8080/health in a new tab to see raw health data.

Tour 5: Time-based Analysis (2 minutes)

Step 9: Change Time Ranges

Location: Top-right of dashboard (time picker) Current: Likely showing “Last 5 minutes”

Try different ranges:
- Last 15 minutes: See longer trends
- Last 1 hour: See full demo session
- Custom range: Pick specific time period
Observe pattern changes:
- Longer ranges: Show trends and patterns
- Shorter ranges: Show real-time detail
- Custom ranges: Zoom into specific incidents

Step 10: Use Dashboard Filters

Location: Top of dashboard - variable dropdowns

Service Filter:
- Select “All” to see everything
- Pick specific service to focus analysis
- Useful for isolating service-specific issues
Event Type Filter:
- Filter to specific event types
- Compare performance across task types
- Identify problematic event categories

💡 Tour Insight: Filtering helps you drill down from system-wide view to specific components or workloads.

Hands-on Experiments

Experiment 1: Create a Service Outage

Goal: See how the dashboard shows service failures

Stop the subscriber:
```
# In subscriber terminal, press Ctrl+C
```
Watch the dashboard changes:
- Error rate increases (top-right gauge turns red)
- Subscriber metrics disappear from bottom panels
- Service health shows subscriber as down
Check Jaeger for failed traces:
- Look for traces that don’t complete
- See where the chain breaks
Restart subscriber:
```
go run agents/subscriber/main.go
```

🎯 Learning: Dashboard immediately shows impact of service failures.

Experiment 2: Generate High Load

Goal: See system behavior under stress

Modify publisher to generate more events:

# Edit agents/publisher/main.go
# Change: time.Sleep(5 * time.Second)
# To:     time.Sleep(1 * time.Second)

Watch dashboard changes:
- Processing rate increases
- Latency may increase
- CPU/memory usage grows
Observe scaling behavior:
- How does the system handle increased load?
- Do error rates increase?
- Where are the bottlenecks?

🎯 Learning: Dashboard shows system performance characteristics under load.

Dashboard Interpretation Guide

What Good Looks Like

✅ Event Processing Rate: Steady activity matching workload ✅ Error Rate: < 5% (green zone) ✅ Event Types: Expected distribution ✅ Latency: p95 < 200ms, p99 < 500ms ✅ CPU Usage: < 50% sustained ✅ Memory: Stable or slow growth with GC cycles ✅ Goroutines: Stable baseline with activity spikes ✅ Service Health: All services green/up

Warning Signs

⚠️ Error Rate: 5-10% (yellow zone) ⚠️ Latency: p95 > 200ms or rising trend ⚠️ CPU: Sustained > 70% ⚠️ Memory: Continuous growth without GC ⚠️ Missing data: Gaps in metrics (service issues)

Critical Issues

🚨 Error Rate: > 10% (red zone) 🚨 Latency: p95 > 500ms 🚨 CPU: Sustained > 90% 🚨 Memory: Rapid growth or OOM 🚨 Service Health: Any service showing red/down 🚨 Traces: Missing or broken trace chains

Next Steps After the Tour

For Daily Operations:

Bookmark: Save dashboard URL for quick access
Set up alerts: Configure notifications for critical metrics
Create views: Use filters to create focused views for your team

For Development:

Add Observability to Your Agent - Instrument your own agents
Debug with Distributed Tracing - Use Jaeger for troubleshooting

For Deep Understanding:

Distributed Tracing Explained - Learn the concepts
Observability Metrics Reference - Complete metrics catalog

Troubleshooting Tour Issues

Issue	Solution
Dashboard shows no data	Verify observability environment variables are set
Grafana won’t load	Check `docker-compose ps` in observability/
Metrics missing	Verify Prometheus targets at http://localhost:9090/targets
Jaeger empty	Ensure trace context propagation is working

🎉 Congratulations! You’ve completed the interactive dashboard tour and learned to read AgentHub’s observability signals like a pro!

🎯 Ready for More?

Master the Tools: Use Grafana Dashboards - Advanced dashboard usage

Troubleshoot Issues: Debug with Distributed Tracing - Use Jaeger effectively

2 - AgentHub Observability Demo Tutorial

Experience the complete observability stack with distributed tracing, real-time metrics, and intelligent alerting in under 10 minutes through hands-on learning.

AgentHub Observability Demo Tutorial

Learn by doing: Experience the complete observability stack with distributed tracing, real-time metrics, and intelligent alerting in under 10 minutes.

What You’ll Learn

By the end of this tutorial, you’ll have:

✅ Seen distributed traces flowing across multiple agents
✅ Monitored real-time metrics in beautiful Grafana dashboards
✅ Understood event correlation through trace IDs
✅ Experienced intelligent alerting when things go wrong
✅ Explored the complete observability stack components

Prerequisites

Go 1.24+ installed
Docker and Docker Compose installed
Environment variables configured (see Installation and Setup)
10 minutes of your time
Basic terminal knowledge

💡 Environment Note: AgentHub agents automatically enable observability when JAEGER_ENDPOINT is configured. See Environment Variables Reference for all configuration options.

Step 1: Clone and Setup (1 minute)

# Clone the repository
git clone https://github.com/owulveryck/agenthub.git
cd agenthub

# Verify you have the observability files
ls observability/
# You should see: docker-compose.yml, grafana/, prometheus/, etc.

Step 2: Start the Observability Stack (2 minutes)

# Navigate to observability directory
cd observability

# Start all monitoring services
docker-compose up -d

# Verify services are running
docker-compose ps

Expected Output:

NAME                    COMMAND                  SERVICE             STATUS
agenthub-grafana        "/run.sh"                grafana             running
agenthub-jaeger         "/go/bin/all-in-one"     jaeger              running
agenthub-prometheus     "/bin/prometheus --c…"   prometheus          running
agenthub-otel-collector "/otelcol-contrib --…"   otel-collector      running

🎯 Checkpoint 1: All services should be “running”. If not, check Docker logs: docker-compose logs <service-name>

Step 3: Access the Dashboards (1 minute)

Open these URLs in your browser (keep them open in tabs):

Service	URL	Purpose
Grafana	http://localhost:3333	Main observability dashboard
Jaeger	http://localhost:16686	Distributed tracing
Prometheus	http://localhost:9090	Raw metrics and alerts

Grafana Login: admin / admin (skip password change for demo)

🎯 Checkpoint 2: You should see Grafana’s welcome page and Jaeger’s empty trace list.

Step 4: Start the Observable Broker (1 minute)

Open a new terminal and navigate back to the project root:

# From agenthub root directory
go run broker/main.go

Expected Output:

time=2025-09-28T21:00:00.000Z level=INFO msg="Starting health server on port 8080"
time=2025-09-28T21:00:00.000Z level=INFO msg="AgentHub broker gRPC server with observability listening" address="[::]:50051" health_endpoint="http://localhost:8080/health" metrics_endpoint="http://localhost:8080/metrics"

🎯 Checkpoint 3:

Broker is listening on port 50051
Health endpoint available at http://localhost:8080/health
Metrics endpoint available at http://localhost:8080/metrics

Step 5: Start the Observable Subscriber (1 minute)

Open another terminal:

go run agents/subscriber/main.go

Expected Output:

time=2025-09-28T21:00:01.000Z level=INFO msg="Starting health server on port 8082"
time=2025-09-28T21:00:01.000Z level=INFO msg="Starting observable subscriber"
time=2025-09-28T21:00:01.000Z level=INFO msg="Agent started with observability. Listening for events and tasks."

🎯 Checkpoint 4:

Subscriber is connected and listening
Health available at http://localhost:8082/health

Step 6: Generate Events with the Publisher (2 minutes)

Open a third terminal:

go run agents/publisher/main.go

Expected Output:

time=2025-09-28T21:00:02.000Z level=INFO msg="Starting health server on port 8081"
time=2025-09-28T21:00:02.000Z level=INFO msg="Starting observable publisher demo"
time=2025-09-28T21:00:02.000Z level=INFO msg="Publishing task" task_id=task_greeting_1727557202 task_type=greeting responder_agent_id=agent_demo_subscriber
time=2025-09-28T21:00:02.000Z level=INFO msg="Task published successfully" task_id=task_greeting_1727557202 task_type=greeting

🎯 Checkpoint 5: You should see:

Publisher creating and sending tasks
Subscriber receiving and processing tasks
Broker routing messages between them

Step 7: Explore Real-time Metrics in Grafana (2 minutes)

Go to Grafana: http://localhost:3333
Navigate to Dashboards → Browse → AgentHub → “AgentHub EDA System Observatory”
Observe the real-time data:

What You’ll See:

Event Processing Rate (Top Left)

Lines showing events/second for each service
Should show activity spikes when publisher runs

Error Rate (Top Right)

Gauge showing error percentage
Should be green (< 5% errors)

Event Types Distribution (Middle Left)

Pie chart showing task types: greeting, math_calculation, random_number
Different colors for each task type

Processing Latency (Middle Right)

Three lines: p50, p95, p99 latencies
Should show sub-second processing times

System Health (Bottom)

CPU usage, memory usage, goroutines
Service health status (all should be UP)

🎯 Checkpoint 6: Dashboard should show live metrics with recent activity.

Step 8: Explore Distributed Traces in Jaeger (2 minutes)

Go to Jaeger: http://localhost:16686
Select Service: Choose “agenthub-broker” from dropdown
Click “Find Traces”
Click on any trace to see details

What You’ll See:

Complete Event Journey:

agenthub-publisher: publish_event (2ms)
  └── agenthub-broker: process_event (1ms)
      └── agenthub-subscriber: consume_event (5ms)
          └── agenthub-subscriber: process_task (15ms)
              └── agenthub-subscriber: publish_result (2ms)

Trace Details:

Span Tags: event_id, event_type, service names
Timing Information: Exact start/end times and durations
Log Correlation: Each span linked to structured logs

Error Detection:

Look for red spans indicating errors
Trace the “unknown_task” type to see how errors propagate

🎯 Checkpoint 7: You should see complete traces showing the full event lifecycle.

Step 9: Correlate Logs with Traces (1 minute)

Copy a trace ID from Jaeger (the long hex string)

Check broker logs for that trace ID:

# In your broker terminal, look for lines like:
time=2025-09-28T21:00:02.000Z level=INFO msg="Received task request" task_id=task_greeting_1727557202 trace_id=a1b2c3d4e5f6...

Check subscriber logs for the same trace ID

🎯 Checkpoint 8: You should find the same trace_id in logs across multiple services.

Step 10: Experience Intelligent Alerting (Optional)

To see alerting in action:

Simulate errors by stopping the subscriber:
```
# In subscriber terminal, press Ctrl+C
```
Keep publisher running (it will fail to process tasks)
Check Prometheus alerts:
- Go to http://localhost:9090/alerts
- After ~5 minutes, you should see “HighEventProcessingErrorRate” firing
Restart subscriber to clear the alert

🎉 Congratulations!

You’ve successfully experienced the complete AgentHub observability stack!

Summary: What You Accomplished

✅ Deployed a complete observability stack with Docker Compose ✅ Ran observable agents with automatic instrumentation ✅ Monitored real-time metrics in Grafana dashboards ✅ Traced event flows across multiple services with Jaeger ✅ Correlated logs with traces using trace IDs ✅ Experienced intelligent alerting with Prometheus ✅ Understood the complete event lifecycle from publisher to subscriber

Key Observability Concepts You Learned

Distributed Tracing

Events get unique trace IDs that follow them everywhere
Each processing step creates a “span” with timing information
Complete request flows are visible across service boundaries

Metrics Collection

47+ different metrics automatically collected
Real-time visualization of system health and performance
Historical data for trend analysis

Structured Logging

All logs include trace context for correlation
Consistent format across all services
Easy debugging and troubleshooting

Intelligent Alerting

Proactive monitoring for error rates and performance
Automatic notifications when thresholds are exceeded
Helps prevent issues before they impact users

Next Steps

For Development:

Add Observability to Your Agent - Instrument your own agents
Debug with Distributed Tracing - Troubleshoot issues effectively

For Operations:

Use Grafana Dashboards - Master the monitoring interface
Configure Alerts - Set up production alerting

For Understanding:

Distributed Tracing Explained - Deep dive into concepts
Observability Architecture - How it all works together

Troubleshooting

Issue	Solution
Services won’t start	Run `docker-compose down && docker-compose up -d`
No metrics in Grafana	Check Prometheus targets: http://localhost:9090/targets
No traces in Jaeger	Verify JAEGER_ENDPOINT environment variable is set correctly
Permission errors	Ensure Docker has proper permissions

Clean Up

When you’re done exploring:

# Stop the observability stack
cd observability
docker-compose down

# Stop the Go applications
# Press Ctrl+C in each terminal running the agents

🎯 Ready for More?

Production Usage: Add Observability to Your Agent

Deep Understanding: Distributed Tracing Explained

Observability

Observability Tutorials

Available Tutorials

1 - Interactive Dashboard Tour

Interactive Dashboard Tour

Prerequisites

Quick Setup Reminder

Dashboard Navigation

Accessing the Main Dashboard

Dashboard Layout Overview

Interactive Tour

Tour 1: Understanding Event Flow (3 minutes)

Step 1: Watch the Event Processing Rate

Step 2: Monitor Error Rates

Tour 2: Event Analysis Deep Dive (3 minutes)

Step 3: Explore Event Types

Step 4: Analyze Processing Latency

Tour 3: Distributed Tracing Exploration (4 minutes)

Step 5: Jump into Jaeger

Step 6: Find and Analyze an Error

Tour 4: System Health Monitoring (3 minutes)

Step 7: Monitor Resource Usage

Step 8: Verify Service Health

Tour 5: Time-based Analysis (2 minutes)

Step 9: Change Time Ranges

Step 10: Use Dashboard Filters

Hands-on Experiments

Experiment 1: Create a Service Outage

Experiment 2: Generate High Load

Dashboard Interpretation Guide

What Good Looks Like

Warning Signs

Critical Issues

Next Steps After the Tour

For Daily Operations:

For Development:

For Deep Understanding:

Troubleshooting Tour Issues

2 - AgentHub Observability Demo Tutorial

AgentHub Observability Demo Tutorial

What You’ll Learn

Prerequisites

Step 1: Clone and Setup (1 minute)

Step 2: Start the Observability Stack (2 minutes)

Step 3: Access the Dashboards (1 minute)

Step 4: Start the Observable Broker (1 minute)

Step 5: Start the Observable Subscriber (1 minute)

Step 6: Generate Events with the Publisher (2 minutes)

Step 7: Explore Real-time Metrics in Grafana (2 minutes)

What You’ll See:

Event Processing Rate (Top Left)

Error Rate (Top Right)

Event Types Distribution (Middle Left)

Processing Latency (Middle Right)

System Health (Bottom)

Step 8: Explore Distributed Traces in Jaeger (2 minutes)

What You’ll See:

Complete Event Journey:

Trace Details:

Error Detection:

Step 9: Correlate Logs with Traces (1 minute)

Step 10: Experience Intelligent Alerting (Optional)

🎉 Congratulations!

Summary: What You Accomplished

Key Observability Concepts You Learned

Distributed Tracing

Metrics Collection

Structured Logging

Intelligent Alerting

Next Steps

For Development:

For Operations:

For Understanding:

Troubleshooting

Clean Up