This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Observability

Practical guides for monitoring and observability setup

1: How to Add Observability to Your Agent
2: How to Use Grafana Dashboards

Observability How-to Guides

Practical step-by-step guides for setting up monitoring, metrics, and observability in your AgentHub deployments.

Available Guides

Add Observability - Integrate tracing and metrics into your agents
Use Dashboards - Set up and customize monitoring dashboards

1 - How to Add Observability to Your Agent

Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents.

How to Add Observability to Your Agent

Goal-oriented guide: Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents with minimal configuration.

Prerequisites

Go 1.24+ installed
Basic understanding of AgentHub concepts
10-15 minutes

Overview: What You Get Automatically

With AgentHub’s unified abstractions, you automatically get:

✅ Distributed Tracing - OpenTelemetry traces with correlation IDs ✅ Comprehensive Metrics - Performance and health monitoring ✅ Structured Logging - JSON logs with trace correlation ✅ Health Endpoints - HTTP health checks and metrics endpoints ✅ Graceful Shutdown - Clean resource management

Quick Start: Observable Agent in 5 Minutes

Step 1: Create Your Agent Using Abstractions

package main

import (
	"context"
	"time"

	"github.com/owulveryck/agenthub/internal/agenthub"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()

	// Create configuration (observability included automatically)
	config := agenthub.NewGRPCConfig("my-agent")
	config.HealthPort = "8083" // Unique port for your agent

	// Create AgentHub client (observability built-in)
	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic("Failed to create AgentHub client: " + err.Error())
	}

	// Automatic graceful shutdown
	defer func() {
		shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 10*time.Second)
		defer shutdownCancel()
		if err := client.Shutdown(shutdownCtx); err != nil {
			client.Logger.ErrorContext(shutdownCtx, "Error during shutdown", "error", err)
		}
	}()

	// Start the client (enables observability)
	if err := client.Start(ctx); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to start client", "error", err)
		panic(err)
	}

	// Your agent logic here...
	client.Logger.Info("My observable agent is running!")

	// Keep running
	select {}
}

That’s it! Your agent now has full observability.

Step 2: Configure Environment Variables

Set observability configuration via environment:

# Tracing configuration
export JAEGER_ENDPOINT="http://localhost:14268/api/traces"
export OTEL_SERVICE_NAME="my-agent"
export OTEL_SERVICE_VERSION="1.0.0"

# Health server port
export BROKER_HEALTH_PORT="8083"

# Broker connection
export AGENTHUB_BROKER_ADDR="localhost"
export AGENTHUB_BROKER_PORT="50051"

Step 3: Run Your Observable Agent

go run main.go

Expected Output:

time=2025-09-29T10:00:00.000Z level=INFO msg="Starting health server" port=8083
time=2025-09-29T10:00:00.000Z level=INFO msg="AgentHub client connected" broker_addr=localhost:50051
time=2025-09-29T10:00:00.000Z level=INFO msg="My observable agent is running!"

Available Observability Features

Automatic Health Endpoints

Your agent automatically exposes:

Health Check: http://localhost:8083/health
Metrics: http://localhost:8083/metrics (Prometheus format)
Readiness: http://localhost:8083/ready

Structured Logging

All logs are automatically structured with trace correlation:

{
  "time": "2025-09-29T10:00:00.000Z",
  "level": "INFO",
  "msg": "Task published",
  "trace_id": "abc123...",
  "span_id": "def456...",
  "task_type": "process_document",
  "correlation_id": "req_789"
}

Distributed Tracing

Traces are automatically created for:

gRPC calls to broker
Task publishing and subscribing
Custom operations (when you use the TraceManager)

Metrics Collection

Automatic metrics include:

Task processing duration
Success/failure rates
gRPC call metrics
Health check status

Advanced Usage

Adding Custom Tracing

Use the built-in TraceManager for custom operations:

// Custom operation with tracing
ctx, span := client.TraceManager.StartPublishSpan(ctx, "my_operation", "document")
defer span.End()

// Add custom attributes
client.TraceManager.AddComponentAttribute(span, "my-component")
span.SetAttributes(attribute.String("document.id", "doc-123"))

// Your operation logic
result, err := doCustomOperation(ctx)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
}

Adding Custom Metrics

Use the MetricsManager for custom metrics:

// Start timing an operation
timer := client.MetricsManager.StartTimer()
defer timer(ctx, "my_operation", "my-component")

// Your operation
processDocument()

Custom Log Fields

Use the structured logger with context:

client.Logger.InfoContext(ctx, "Processing document",
    "document_id", "doc-123",
    "user_id", "user-456",
    "processing_type", "ocr",
)

Publisher Example with Observability

package main

import (
	"context"
	"time"

	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()

	// Observable client setup
	config := agenthub.NewGRPCConfig("publisher")
	config.HealthPort = "8081"

	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())

	if err := client.Start(ctx); err != nil {
		panic(err)
	}

	// Create observable task publisher
	publisher := &agenthub.TaskPublisher{
		Client:         client.Client,
		TraceManager:   client.TraceManager,
		MetricsManager: client.MetricsManager,
		Logger:         client.Logger,
		ComponentName:  "publisher",
	}

	// Publish task with automatic tracing
	data, _ := structpb.NewStruct(map[string]interface{}{
		"message": "Hello, observable world!",
	})

	task := &pb.TaskMessage{
		TaskId:   "task-123",
		TaskType: "greeting",
		Data:     data,
		Priority: pb.Priority_MEDIUM,
	}

	// Automatically traced and metered
	if err := publisher.PublishTask(ctx, task); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to publish task", "error", err)
	} else {
		client.Logger.InfoContext(ctx, "Task published successfully", "task_id", task.TaskId)
	}
}

Subscriber Example with Observability

package main

import (
	"context"
	"os"
	"os/signal"
	"syscall"

	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	// Observable client setup
	config := agenthub.NewGRPCConfig("subscriber")
	config.HealthPort = "8082"

	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())

	if err := client.Start(ctx); err != nil {
		panic(err)
	}

	// Create observable task subscriber
	subscriber := agenthub.NewTaskSubscriber(client, "my-subscriber")

	// Register handler with automatic tracing
	subscriber.RegisterHandler("greeting", func(ctx context.Context, task *pb.TaskMessage) (*structpb.Struct, pb.TaskStatus, string) {
		// This is automatically traced and logged
		client.Logger.InfoContext(ctx, "Processing greeting task", "task_id", task.TaskId)

		// Your processing logic
		result, _ := structpb.NewStruct(map[string]interface{}{
			"response": "Hello back!",
		})

		return result, pb.TaskStatus_COMPLETED, ""
	})

	// Start processing with automatic observability
	go subscriber.StartProcessing(ctx)

	// Graceful shutdown
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
	<-sigChan
}

Configuration Reference

📖 Complete Reference: For all environment variables and configuration options, see Environment Variables Reference

Key Environment Variables

Variable	Description	Default
`JAEGER_ENDPOINT`	Jaeger tracing endpoint	"" (tracing disabled)
`SERVICE_NAME`	Service name for tracing	“agenthub-service”
`SERVICE_VERSION`	Service version	“1.0.0”
`BROKER_HEALTH_PORT`	Health endpoint port	“8080”
`AGENTHUB_BROKER_ADDR`	Broker address	“localhost”
`AGENTHUB_BROKER_PORT`	Broker port	“50051”

Health Endpoints

Each agent exposes these endpoints:

Endpoint	Purpose	Response
`/health`	Overall health status	JSON status
`/metrics`	Prometheus metrics	Metrics format
`/ready`	Readiness check	200 OK or 503

Troubleshooting

Common Issues

Issue	Solution
No traces in Jaeger	Set `JAEGER_ENDPOINT` environment variable
Health endpoint not accessible	Check `BROKER_HEALTH_PORT` is unique
Logs not structured	Ensure using `client.Logger` not standard `log`
Missing correlation IDs	Use `context.Context` in all operations

Verification Steps

Check health endpoint:
```
curl http://localhost:8083/health
```
Verify metrics:
```
curl http://localhost:8083/metrics
```
Check traces in Jaeger:
- Open http://localhost:16686
- Search for your service name

Migration from Manual Setup

If you have existing agents using manual observability setup:

Old Approach (Manual)

// 50+ lines of OpenTelemetry setup
obs, err := observability.NewObservability(config)
traceManager := observability.NewTraceManager(serviceName)
// Manual gRPC client setup
// Manual health server setup

New Approach (Unified)

// 3 lines - everything automatic
config := agenthub.NewGRPCConfig("my-agent")
client, err := agenthub.NewAgentHubClient(config)
client.Start(ctx)

The unified abstractions provide the same observability features with 90% less code and no manual setup required.

With AgentHub’s unified abstractions, observability is no longer an add-on feature but a built-in capability that comes automatically with every agent. Focus on your business logic while the platform handles monitoring, tracing, and health checks for you.

2 - How to Use Grafana Dashboards

Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.

How to Use Grafana Dashboards

Goal-oriented guide: Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.

Prerequisites

AgentHub observability stack running (docker-compose up -d)
AgentHub agents running with observability enabled
Basic understanding of metrics concepts
10-15 minutes

Quick Access

Grafana Dashboard: http://localhost:3333 (admin/admin)
Direct Dashboard: http://localhost:3333/d/agenthub-eda-dashboard

Dashboard Overview

The AgentHub EDA System Observatory provides comprehensive monitoring across three main areas:

Event Metrics (Top Row) - Event processing performance
Distributed Tracing (Middle) - Request flow visualization
System Health (Bottom Row) - Infrastructure monitoring

Panel-by-Panel Guide

🚀 Event Processing Rate (Top Left)

What it shows: Events processed per second by each service

How to use:

Monitor throughput: See how many events your system processes
Identify bottlenecks: Low rates may indicate performance issues
Compare services: See which agents are busiest

Reading the chart:

Green line: agenthub-broker (150 events/sec)
Blue line:  agenthub-publisher (50 events/sec)
Red line:   agenthub-subscriber (145 events/sec)

Troubleshooting:

Flat lines: No activity - check if agents are running
Dropping rates: Performance degradation - check CPU/memory
Spiky patterns: Bursty workloads - consider load balancing

🚨 Event Processing Error Rate (Top Right)

What it shows: Percentage of events that failed processing

How to use:

Monitor reliability: Should stay below 5% (green zone)
Alert threshold: Yellow above 5%, red above 10%
Quick health check: Single glance system reliability

Color coding:

Green (0-5%): Healthy system
Yellow (5-10%): Moderate issues
Red (>10%): Critical problems

Troubleshooting:

High error rates: Check Jaeger for failing traces
Sudden spikes: Look for recent deployments or config changes
Persistent errors: Check logs for recurring issues

📈 Event Types Distribution (Middle Left)

What it shows: Breakdown of event types by volume

How to use:

Understand workload: See what types of tasks dominate
Capacity planning: Identify which task types need scaling
Anomaly detection: Unusual distributions may indicate issues

Example interpretation:

greeting: 40% (blue) - Most common task type
math_calculation: 35% (green) - Heavy computational tasks
random_number: 20% (yellow) - Quick tasks
unknown_task: 5% (red) - Error-generating tasks

Troubleshooting:

Missing task types: Check if specific agents are down
Unexpected distributions: May indicate upstream issues
Dominant error types: Focus optimization efforts

⏱️ Event Processing Latency (Middle Right)

What it shows: Processing time percentiles (p50, p95, p99)

How to use:

Performance monitoring: Track how fast events are processed
SLA compliance: Ensure latencies meet requirements
Outlier detection: p99 shows worst-case scenarios

Understanding percentiles:

p50 (median): 50% of events process faster than this
p95: 95% of events process faster than this
p99: 99% of events process faster than this

Healthy ranges:

p50: < 50ms (very responsive)
p95: < 200ms (good performance)
p99: < 500ms (acceptable outliers)

Troubleshooting:

Rising latencies: Check CPU/memory usage
High p99: Look for resource contention or long-running tasks
Flatlined metrics: May indicate measurement issues

🔍 Distributed Traces (Middle Section)

What it shows: Integration with Jaeger for trace visualization

How to use:

Click “Explore” to open Jaeger
Select service from dropdown
Find specific traces to debug issues
Analyze request flows across services

When to use:

Debugging errors: Find root cause of failures
Performance analysis: Identify slow operations
Understanding flows: See complete request journeys

🖥️ Service CPU Usage (Bottom Left)

What it shows: CPU utilization by service

How to use:

Capacity monitoring: Ensure services aren’t overloaded
Resource planning: Identify when to scale
Performance correlation: High CPU often explains high latency

Healthy ranges:

< 50%: Comfortable utilization
50-70%: Moderate load
> 70%: Consider scaling

💾 Service Memory Usage (Bottom Center)

What it shows: Memory consumption by service

How to use:

Memory leak detection: Watch for continuously growing usage
Capacity planning: Ensure sufficient memory allocation
Garbage collection: High usage may impact performance

Monitoring tips:

Steady growth: May indicate memory leaks
Sawtooth pattern: Normal GC behavior
Sudden spikes: Check for large event batches

🧵 Go Goroutines (Bottom Right)

What it shows: Number of concurrent goroutines per service

How to use:

Concurrency monitoring: Track parallel processing
Resource leak detection: Continuously growing numbers indicate leaks
Performance tuning: Optimize concurrency levels

Normal patterns:

Stable baseline: Normal operation
Activity spikes: During high load
Continuous growth: Potential goroutine leaks

🏥 Service Health Status (Bottom Far Right)

What it shows: Up/down status of each service

How to use:

Quick status check: See if all services are running
Outage detection: Immediately identify down services
Health monitoring: Green = UP, Red = DOWN

Dashboard Variables and Filters

Service Filter

Location: Top of dashboard Purpose: Filter metrics by specific services Usage:

Select “All” to see everything
Choose specific services to focus analysis
Useful for isolating problems to specific components

Event Type Filter

Location: Top of dashboard Purpose: Filter by event/task types Usage:

Analyze specific workflow types
Debug particular task categories
Compare performance across task types

Time Range Selector

Location: Top right of dashboard Purpose: Control time window for analysis Common ranges:

5 minutes: Real-time monitoring
1 hour: Recent trend analysis
24 hours: Daily pattern analysis
7 days: Weekly trend and capacity planning

Advanced Usage Patterns

Performance Investigation Workflow

Start with Overview:
- Check error rates (should be < 5%)
- Verify processing rates look normal
- Scan for any red/yellow indicators
Drill Down on Issues:
- If high error rates → check distributed traces
- If high latency → examine CPU/memory usage
- If low throughput → check service health
Root Cause Analysis:
- Use time range selector to find when problems started
- Filter by specific services to isolate issues
- Correlate metrics across different panels

Capacity Planning Workflow

Analyze Peak Patterns:
- Set time range to 7 days
- Identify peak usage periods
- Note maximum throughput achieved
Resource Utilization:
- Check CPU usage during peaks
- Monitor memory consumption trends
- Verify goroutine scaling behavior
Plan Scaling:
- If CPU > 70% during peaks, scale up
- If memory continuously growing, investigate leaks
- If error rates spike during load, optimize before scaling

Troubleshooting Workflow

Identify Symptoms:
- High error rates: Focus on traces and logs
- High latency: Check resource utilization
- Low throughput: Verify service health
Time Correlation:
- Use time range to find when issues started
- Look for correlated changes across metrics
- Check for deployment or configuration changes
Service Isolation:
- Use service filter to identify problematic components
- Compare healthy vs unhealthy services
- Check inter-service dependencies

Dashboard Customization

Adding New Panels

Click “+ Add panel” in top menu
Choose visualization type:
- Time series for trends
- Stat for current values
- Gauge for thresholds

Configure query:

# Example: Custom error rate
rate(my_custom_errors_total[5m]) / rate(my_custom_requests_total[5m]) * 100

Creating Alerts

Edit existing panel or create new one
Click “Alert” tab

Configure conditions:

Query: rate(event_errors_total[5m]) / rate(events_processed_total[5m]) * 100
Condition: IS ABOVE 5
Evaluation: Every 1m for 2m

Set notification channels

Custom Time Ranges

Click time picker (top right)
Select “Custom range”
Set specific dates/times for historical analysis
Use “Refresh” settings for auto-updating

Troubleshooting Dashboard Issues

Dashboard Not Loading

# Check Grafana status
docker-compose ps grafana

# Check Grafana logs
docker-compose logs grafana

# Restart if needed
docker-compose restart grafana

No Data in Panels

# Check Prometheus connection
curl http://localhost:9090/api/v1/targets

# Verify agents are exposing metrics
curl http://localhost:8080/metrics
curl http://localhost:8081/metrics
curl http://localhost:8082/metrics

# Check Prometheus configuration
docker-compose logs prometheus

Slow Dashboard Performance

Reduce time range: Use shorter windows for better performance
Limit service selection: Filter to specific services
Optimize queries: Use appropriate rate intervals
Check resource usage: Ensure Prometheus has enough memory

Authentication Issues

Default credentials: admin/admin
Reset password: Through Grafana UI after first login
Lost access: Restart Grafana container to reset

Best Practices

Regular Monitoring

Check dashboard daily: Quick health overview
Weekly reviews: Trend analysis and capacity planning
Set up alerts: Proactive monitoring for critical metrics

Performance Optimization

Use appropriate time ranges: Don’t query more data than needed
Filter effectively: Use service and event type filters
Refresh intervals: Balance real-time needs with performance

Team Usage

Share dashboard URLs: Bookmark specific views
Create annotations: Mark deployments and incidents
Export snapshots: Share findings with team members

Integration with Other Tools

Jaeger Integration

Click Explore in traces panel
Auto-links to Jaeger with service context
Correlate traces with metrics timeframes

Prometheus Integration

Click Explore on any panel
Edit queries in Prometheus query language
Access raw metrics for custom analysis

Log Correlation

Use trace IDs from Jaeger
Search logs for matching trace IDs
Correlate log events with metric spikes

🎯 Next Steps:

Deep Debugging: Debug with Distributed Tracing

Production Setup: Configure Alerts

Understanding: Observability Architecture Explained