This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Observability

Practical guides for monitoring and observability setup

Observability How-to Guides

Practical step-by-step guides for setting up monitoring, metrics, and observability in your AgentHub deployments.

Available Guides

1 - How to Add Observability to Your Agent

Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents.

How to Add Observability to Your Agent

Goal-oriented guide: Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents with minimal configuration.

Prerequisites

  • Go 1.24+ installed
  • Basic understanding of AgentHub concepts
  • 10-15 minutes

Overview: What You Get Automatically

With AgentHub’s unified abstractions, you automatically get:

Distributed Tracing - OpenTelemetry traces with correlation IDs ✅ Comprehensive Metrics - Performance and health monitoring ✅ Structured Logging - JSON logs with trace correlation ✅ Health Endpoints - HTTP health checks and metrics endpoints ✅ Graceful Shutdown - Clean resource management

Quick Start: Observable Agent in 5 Minutes

Step 1: Create Your Agent Using Abstractions

package main

import (
	"context"
	"time"

	"github.com/owulveryck/agenthub/internal/agenthub"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()

	// Create configuration (observability included automatically)
	config := agenthub.NewGRPCConfig("my-agent")
	config.HealthPort = "8083" // Unique port for your agent

	// Create AgentHub client (observability built-in)
	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic("Failed to create AgentHub client: " + err.Error())
	}

	// Automatic graceful shutdown
	defer func() {
		shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 10*time.Second)
		defer shutdownCancel()
		if err := client.Shutdown(shutdownCtx); err != nil {
			client.Logger.ErrorContext(shutdownCtx, "Error during shutdown", "error", err)
		}
	}()

	// Start the client (enables observability)
	if err := client.Start(ctx); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to start client", "error", err)
		panic(err)
	}

	// Your agent logic here...
	client.Logger.Info("My observable agent is running!")

	// Keep running
	select {}
}

That’s it! Your agent now has full observability.

Step 2: Configure Environment Variables

Set observability configuration via environment:

# Tracing configuration
export JAEGER_ENDPOINT="http://localhost:14268/api/traces"
export OTEL_SERVICE_NAME="my-agent"
export OTEL_SERVICE_VERSION="1.0.0"

# Health server port
export BROKER_HEALTH_PORT="8083"

# Broker connection
export AGENTHUB_BROKER_ADDR="localhost"
export AGENTHUB_BROKER_PORT="50051"

Step 3: Run Your Observable Agent

go run main.go

Expected Output:

time=2025-09-29T10:00:00.000Z level=INFO msg="Starting health server" port=8083
time=2025-09-29T10:00:00.000Z level=INFO msg="AgentHub client connected" broker_addr=localhost:50051
time=2025-09-29T10:00:00.000Z level=INFO msg="My observable agent is running!"

Available Observability Features

Automatic Health Endpoints

Your agent automatically exposes:

  • Health Check: http://localhost:8083/health
  • Metrics: http://localhost:8083/metrics (Prometheus format)
  • Readiness: http://localhost:8083/ready

Structured Logging

All logs are automatically structured with trace correlation:

{
  "time": "2025-09-29T10:00:00.000Z",
  "level": "INFO",
  "msg": "Task published",
  "trace_id": "abc123...",
  "span_id": "def456...",
  "task_type": "process_document",
  "correlation_id": "req_789"
}

Distributed Tracing

Traces are automatically created for:

  • gRPC calls to broker
  • Task publishing and subscribing
  • Custom operations (when you use the TraceManager)

Metrics Collection

Automatic metrics include:

  • Task processing duration
  • Success/failure rates
  • gRPC call metrics
  • Health check status

Advanced Usage

Adding Custom Tracing

Use the built-in TraceManager for custom operations:

// Custom operation with tracing
ctx, span := client.TraceManager.StartPublishSpan(ctx, "my_operation", "document")
defer span.End()

// Add custom attributes
client.TraceManager.AddComponentAttribute(span, "my-component")
span.SetAttributes(attribute.String("document.id", "doc-123"))

// Your operation logic
result, err := doCustomOperation(ctx)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
}

Adding Custom Metrics

Use the MetricsManager for custom metrics:

// Start timing an operation
timer := client.MetricsManager.StartTimer()
defer timer(ctx, "my_operation", "my-component")

// Your operation
processDocument()

Custom Log Fields

Use the structured logger with context:

client.Logger.InfoContext(ctx, "Processing document",
    "document_id", "doc-123",
    "user_id", "user-456",
    "processing_type", "ocr",
)

Publisher Example with Observability

package main

import (
	"context"
	"time"

	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()

	// Observable client setup
	config := agenthub.NewGRPCConfig("publisher")
	config.HealthPort = "8081"

	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())

	if err := client.Start(ctx); err != nil {
		panic(err)
	}

	// Create observable task publisher
	publisher := &agenthub.TaskPublisher{
		Client:         client.Client,
		TraceManager:   client.TraceManager,
		MetricsManager: client.MetricsManager,
		Logger:         client.Logger,
		ComponentName:  "publisher",
	}

	// Publish task with automatic tracing
	data, _ := structpb.NewStruct(map[string]interface{}{
		"message": "Hello, observable world!",
	})

	task := &pb.TaskMessage{
		TaskId:   "task-123",
		TaskType: "greeting",
		Data:     data,
		Priority: pb.Priority_MEDIUM,
	}

	// Automatically traced and metered
	if err := publisher.PublishTask(ctx, task); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to publish task", "error", err)
	} else {
		client.Logger.InfoContext(ctx, "Task published successfully", "task_id", task.TaskId)
	}
}

Subscriber Example with Observability

package main

import (
	"context"
	"os"
	"os/signal"
	"syscall"

	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	// Observable client setup
	config := agenthub.NewGRPCConfig("subscriber")
	config.HealthPort = "8082"

	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())

	if err := client.Start(ctx); err != nil {
		panic(err)
	}

	// Create observable task subscriber
	subscriber := agenthub.NewTaskSubscriber(client, "my-subscriber")

	// Register handler with automatic tracing
	subscriber.RegisterHandler("greeting", func(ctx context.Context, task *pb.TaskMessage) (*structpb.Struct, pb.TaskStatus, string) {
		// This is automatically traced and logged
		client.Logger.InfoContext(ctx, "Processing greeting task", "task_id", task.TaskId)

		// Your processing logic
		result, _ := structpb.NewStruct(map[string]interface{}{
			"response": "Hello back!",
		})

		return result, pb.TaskStatus_COMPLETED, ""
	})

	// Start processing with automatic observability
	go subscriber.StartProcessing(ctx)

	// Graceful shutdown
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
	<-sigChan
}

Configuration Reference

📖 Complete Reference: For all environment variables and configuration options, see Environment Variables Reference

Key Environment Variables

VariableDescriptionDefault
JAEGER_ENDPOINTJaeger tracing endpoint"" (tracing disabled)
SERVICE_NAMEService name for tracing“agenthub-service”
SERVICE_VERSIONService version“1.0.0”
BROKER_HEALTH_PORTHealth endpoint port“8080”
AGENTHUB_BROKER_ADDRBroker address“localhost”
AGENTHUB_BROKER_PORTBroker port“50051”

Health Endpoints

Each agent exposes these endpoints:

EndpointPurposeResponse
/healthOverall health statusJSON status
/metricsPrometheus metricsMetrics format
/readyReadiness check200 OK or 503

Troubleshooting

Common Issues

IssueSolution
No traces in JaegerSet JAEGER_ENDPOINT environment variable
Health endpoint not accessibleCheck BROKER_HEALTH_PORT is unique
Logs not structuredEnsure using client.Logger not standard log
Missing correlation IDsUse context.Context in all operations

Verification Steps

  1. Check health endpoint:

    curl http://localhost:8083/health
    
  2. Verify metrics:

    curl http://localhost:8083/metrics
    
  3. Check traces in Jaeger:

    • Open http://localhost:16686
    • Search for your service name

Migration from Manual Setup

If you have existing agents using manual observability setup:

Old Approach (Manual)

// 50+ lines of OpenTelemetry setup
obs, err := observability.NewObservability(config)
traceManager := observability.NewTraceManager(serviceName)
// Manual gRPC client setup
// Manual health server setup

New Approach (Unified)

// 3 lines - everything automatic
config := agenthub.NewGRPCConfig("my-agent")
client, err := agenthub.NewAgentHubClient(config)
client.Start(ctx)

The unified abstractions provide the same observability features with 90% less code and no manual setup required.


With AgentHub’s unified abstractions, observability is no longer an add-on feature but a built-in capability that comes automatically with every agent. Focus on your business logic while the platform handles monitoring, tracing, and health checks for you.

2 - How to Use Grafana Dashboards

Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.

How to Use Grafana Dashboards

Goal-oriented guide: Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.

Prerequisites

  • AgentHub observability stack running (docker-compose up -d)
  • AgentHub agents running with observability enabled
  • Basic understanding of metrics concepts
  • 10-15 minutes

Quick Access

  • Grafana Dashboard: http://localhost:3333 (admin/admin)
  • Direct Dashboard: http://localhost:3333/d/agenthub-eda-dashboard

Dashboard Overview

The AgentHub EDA System Observatory provides comprehensive monitoring across three main areas:

  1. Event Metrics (Top Row) - Event processing performance
  2. Distributed Tracing (Middle) - Request flow visualization
  3. System Health (Bottom Row) - Infrastructure monitoring

Panel-by-Panel Guide

🚀 Event Processing Rate (Top Left)

What it shows: Events processed per second by each service

How to use:

  • Monitor throughput: See how many events your system processes
  • Identify bottlenecks: Low rates may indicate performance issues
  • Compare services: See which agents are busiest

Reading the chart:

Green line: agenthub-broker (150 events/sec)
Blue line:  agenthub-publisher (50 events/sec)
Red line:   agenthub-subscriber (145 events/sec)

Troubleshooting:

  • Flat lines: No activity - check if agents are running
  • Dropping rates: Performance degradation - check CPU/memory
  • Spiky patterns: Bursty workloads - consider load balancing

🚨 Event Processing Error Rate (Top Right)

What it shows: Percentage of events that failed processing

How to use:

  • Monitor reliability: Should stay below 5% (green zone)
  • Alert threshold: Yellow above 5%, red above 10%
  • Quick health check: Single glance system reliability

Color coding:

  • Green (0-5%): Healthy system
  • Yellow (5-10%): Moderate issues
  • Red (>10%): Critical problems

Troubleshooting:

  • High error rates: Check Jaeger for failing traces
  • Sudden spikes: Look for recent deployments or config changes
  • Persistent errors: Check logs for recurring issues

📈 Event Types Distribution (Middle Left)

What it shows: Breakdown of event types by volume

How to use:

  • Understand workload: See what types of tasks dominate
  • Capacity planning: Identify which task types need scaling
  • Anomaly detection: Unusual distributions may indicate issues

Example interpretation:

greeting: 40% (blue) - Most common task type
math_calculation: 35% (green) - Heavy computational tasks
random_number: 20% (yellow) - Quick tasks
unknown_task: 5% (red) - Error-generating tasks

Troubleshooting:

  • Missing task types: Check if specific agents are down
  • Unexpected distributions: May indicate upstream issues
  • Dominant error types: Focus optimization efforts

⏱️ Event Processing Latency (Middle Right)

What it shows: Processing time percentiles (p50, p95, p99)

How to use:

  • Performance monitoring: Track how fast events are processed
  • SLA compliance: Ensure latencies meet requirements
  • Outlier detection: p99 shows worst-case scenarios

Understanding percentiles:

  • p50 (median): 50% of events process faster than this
  • p95: 95% of events process faster than this
  • p99: 99% of events process faster than this

Healthy ranges:

  • p50: < 50ms (very responsive)
  • p95: < 200ms (good performance)
  • p99: < 500ms (acceptable outliers)

Troubleshooting:

  • Rising latencies: Check CPU/memory usage
  • High p99: Look for resource contention or long-running tasks
  • Flatlined metrics: May indicate measurement issues

🔍 Distributed Traces (Middle Section)

What it shows: Integration with Jaeger for trace visualization

How to use:

  1. Click “Explore” to open Jaeger
  2. Select service from dropdown
  3. Find specific traces to debug issues
  4. Analyze request flows across services

When to use:

  • Debugging errors: Find root cause of failures
  • Performance analysis: Identify slow operations
  • Understanding flows: See complete request journeys

🖥️ Service CPU Usage (Bottom Left)

What it shows: CPU utilization by service

How to use:

  • Capacity monitoring: Ensure services aren’t overloaded
  • Resource planning: Identify when to scale
  • Performance correlation: High CPU often explains high latency

Healthy ranges:

  • < 50%: Comfortable utilization
  • 50-70%: Moderate load
  • > 70%: Consider scaling

💾 Service Memory Usage (Bottom Center)

What it shows: Memory consumption by service

How to use:

  • Memory leak detection: Watch for continuously growing usage
  • Capacity planning: Ensure sufficient memory allocation
  • Garbage collection: High usage may impact performance

Monitoring tips:

  • Steady growth: May indicate memory leaks
  • Sawtooth pattern: Normal GC behavior
  • Sudden spikes: Check for large event batches

🧵 Go Goroutines (Bottom Right)

What it shows: Number of concurrent goroutines per service

How to use:

  • Concurrency monitoring: Track parallel processing
  • Resource leak detection: Continuously growing numbers indicate leaks
  • Performance tuning: Optimize concurrency levels

Normal patterns:

  • Stable baseline: Normal operation
  • Activity spikes: During high load
  • Continuous growth: Potential goroutine leaks

🏥 Service Health Status (Bottom Far Right)

What it shows: Up/down status of each service

How to use:

  • Quick status check: See if all services are running
  • Outage detection: Immediately identify down services
  • Health monitoring: Green = UP, Red = DOWN

Dashboard Variables and Filters

Service Filter

Location: Top of dashboard Purpose: Filter metrics by specific services Usage:

  • Select “All” to see everything
  • Choose specific services to focus analysis
  • Useful for isolating problems to specific components

Event Type Filter

Location: Top of dashboard Purpose: Filter by event/task types Usage:

  • Analyze specific workflow types
  • Debug particular task categories
  • Compare performance across task types

Time Range Selector

Location: Top right of dashboard Purpose: Control time window for analysis Common ranges:

  • 5 minutes: Real-time monitoring
  • 1 hour: Recent trend analysis
  • 24 hours: Daily pattern analysis
  • 7 days: Weekly trend and capacity planning

Advanced Usage Patterns

Performance Investigation Workflow

  1. Start with Overview:

    • Check error rates (should be < 5%)
    • Verify processing rates look normal
    • Scan for any red/yellow indicators
  2. Drill Down on Issues:

    • If high error rates → check distributed traces
    • If high latency → examine CPU/memory usage
    • If low throughput → check service health
  3. Root Cause Analysis:

    • Use time range selector to find when problems started
    • Filter by specific services to isolate issues
    • Correlate metrics across different panels

Capacity Planning Workflow

  1. Analyze Peak Patterns:

    • Set time range to 7 days
    • Identify peak usage periods
    • Note maximum throughput achieved
  2. Resource Utilization:

    • Check CPU usage during peaks
    • Monitor memory consumption trends
    • Verify goroutine scaling behavior
  3. Plan Scaling:

    • If CPU > 70% during peaks, scale up
    • If memory continuously growing, investigate leaks
    • If error rates spike during load, optimize before scaling

Troubleshooting Workflow

  1. Identify Symptoms:

    • High error rates: Focus on traces and logs
    • High latency: Check resource utilization
    • Low throughput: Verify service health
  2. Time Correlation:

    • Use time range to find when issues started
    • Look for correlated changes across metrics
    • Check for deployment or configuration changes
  3. Service Isolation:

    • Use service filter to identify problematic components
    • Compare healthy vs unhealthy services
    • Check inter-service dependencies

Dashboard Customization

Adding New Panels

  1. Click “+ Add panel” in top menu
  2. Choose visualization type:
    • Time series for trends
    • Stat for current values
    • Gauge for thresholds
  3. Configure query:
    # Example: Custom error rate
    rate(my_custom_errors_total[5m]) / rate(my_custom_requests_total[5m]) * 100
    

Creating Alerts

  1. Edit existing panel or create new one
  2. Click “Alert” tab
  3. Configure conditions:
    Query: rate(event_errors_total[5m]) / rate(events_processed_total[5m]) * 100
    Condition: IS ABOVE 5
    Evaluation: Every 1m for 2m
    
  4. Set notification channels

Custom Time Ranges

  1. Click time picker (top right)
  2. Select “Custom range”
  3. Set specific dates/times for historical analysis
  4. Use “Refresh” settings for auto-updating

Troubleshooting Dashboard Issues

Dashboard Not Loading

# Check Grafana status
docker-compose ps grafana

# Check Grafana logs
docker-compose logs grafana

# Restart if needed
docker-compose restart grafana

No Data in Panels

# Check Prometheus connection
curl http://localhost:9090/api/v1/targets

# Verify agents are exposing metrics
curl http://localhost:8080/metrics
curl http://localhost:8081/metrics
curl http://localhost:8082/metrics

# Check Prometheus configuration
docker-compose logs prometheus

Slow Dashboard Performance

  1. Reduce time range: Use shorter windows for better performance
  2. Limit service selection: Filter to specific services
  3. Optimize queries: Use appropriate rate intervals
  4. Check resource usage: Ensure Prometheus has enough memory

Authentication Issues

  • Default credentials: admin/admin
  • Reset password: Through Grafana UI after first login
  • Lost access: Restart Grafana container to reset

Best Practices

Regular Monitoring

  • Check dashboard daily: Quick health overview
  • Weekly reviews: Trend analysis and capacity planning
  • Set up alerts: Proactive monitoring for critical metrics

Performance Optimization

  • Use appropriate time ranges: Don’t query more data than needed
  • Filter effectively: Use service and event type filters
  • Refresh intervals: Balance real-time needs with performance

Team Usage

  • Share dashboard URLs: Bookmark specific views
  • Create annotations: Mark deployments and incidents
  • Export snapshots: Share findings with team members

Integration with Other Tools

Jaeger Integration

  • Click Explore in traces panel
  • Auto-links to Jaeger with service context
  • Correlate traces with metrics timeframes

Prometheus Integration

  • Click Explore on any panel
  • Edit queries in Prometheus query language
  • Access raw metrics for custom analysis

Log Correlation

  • Use trace IDs from Jaeger
  • Search logs for matching trace IDs
  • Correlate log events with metric spikes

🎯 Next Steps:

Deep Debugging: Debug with Distributed Tracing

Production Setup: Configure Alerts

Understanding: Observability Architecture Explained