1 - How to Add Observability to Your Agent
Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents.
How to Add Observability to Your Agent
Goal-oriented guide: Use AgentHub’s unified abstractions to automatically get distributed tracing, metrics, and structured logging in your agents with minimal configuration.
Prerequisites
- Go 1.24+ installed
- Basic understanding of AgentHub concepts
- 10-15 minutes
Overview: What You Get Automatically
With AgentHub’s unified abstractions, you automatically get:
✅ Distributed Tracing - OpenTelemetry traces with correlation IDs
✅ Comprehensive Metrics - Performance and health monitoring
✅ Structured Logging - JSON logs with trace correlation
✅ Health Endpoints - HTTP health checks and metrics endpoints
✅ Graceful Shutdown - Clean resource management
Quick Start: Observable Agent in 5 Minutes
Step 1: Create Your Agent Using Abstractions
package main
import (
	"context"
	"time"
	"github.com/owulveryck/agenthub/internal/agenthub"
)
func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()
	// Create configuration (observability included automatically)
	config := agenthub.NewGRPCConfig("my-agent")
	config.HealthPort = "8083" // Unique port for your agent
	// Create AgentHub client (observability built-in)
	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic("Failed to create AgentHub client: " + err.Error())
	}
	// Automatic graceful shutdown
	defer func() {
		shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 10*time.Second)
		defer shutdownCancel()
		if err := client.Shutdown(shutdownCtx); err != nil {
			client.Logger.ErrorContext(shutdownCtx, "Error during shutdown", "error", err)
		}
	}()
	// Start the client (enables observability)
	if err := client.Start(ctx); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to start client", "error", err)
		panic(err)
	}
	// Your agent logic here...
	client.Logger.Info("My observable agent is running!")
	// Keep running
	select {}
}
That’s it! Your agent now has full observability.
Set observability configuration via environment:
# Tracing configuration
export JAEGER_ENDPOINT="http://localhost:14268/api/traces"
export OTEL_SERVICE_NAME="my-agent"
export OTEL_SERVICE_VERSION="1.0.0"
# Health server port
export BROKER_HEALTH_PORT="8083"
# Broker connection
export AGENTHUB_BROKER_ADDR="localhost"
export AGENTHUB_BROKER_PORT="50051"
Step 3: Run Your Observable Agent
Expected Output:
time=2025-09-29T10:00:00.000Z level=INFO msg="Starting health server" port=8083
time=2025-09-29T10:00:00.000Z level=INFO msg="AgentHub client connected" broker_addr=localhost:50051
time=2025-09-29T10:00:00.000Z level=INFO msg="My observable agent is running!"
Available Observability Features
Automatic Health Endpoints
Your agent automatically exposes:
- Health Check: http://localhost:8083/health
- Metrics: http://localhost:8083/metrics(Prometheus format)
- Readiness: http://localhost:8083/ready
Structured Logging
All logs are automatically structured with trace correlation:
{
  "time": "2025-09-29T10:00:00.000Z",
  "level": "INFO",
  "msg": "Task published",
  "trace_id": "abc123...",
  "span_id": "def456...",
  "task_type": "process_document",
  "correlation_id": "req_789"
}
Distributed Tracing
Traces are automatically created for:
- gRPC calls to broker
- Task publishing and subscribing
- Custom operations (when you use the TraceManager)
Metrics Collection
Automatic metrics include:
- Task processing duration
- Success/failure rates
- gRPC call metrics
- Health check status
Advanced Usage
Adding Custom Tracing
Use the built-in TraceManager for custom operations:
// Custom operation with tracing
ctx, span := client.TraceManager.StartPublishSpan(ctx, "my_operation", "document")
defer span.End()
// Add custom attributes
client.TraceManager.AddComponentAttribute(span, "my-component")
span.SetAttributes(attribute.String("document.id", "doc-123"))
// Your operation logic
result, err := doCustomOperation(ctx)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
}
Adding Custom Metrics
Use the MetricsManager for custom metrics:
// Start timing an operation
timer := client.MetricsManager.StartTimer()
defer timer(ctx, "my_operation", "my-component")
// Your operation
processDocument()
Custom Log Fields
Use the structured logger with context:
client.Logger.InfoContext(ctx, "Processing document",
    "document_id", "doc-123",
    "user_id", "user-456",
    "processing_type", "ocr",
)
Publisher Example with Observability
package main
import (
	"context"
	"time"
	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)
func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()
	// Observable client setup
	config := agenthub.NewGRPCConfig("publisher")
	config.HealthPort = "8081"
	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())
	if err := client.Start(ctx); err != nil {
		panic(err)
	}
	// Create observable task publisher
	publisher := &agenthub.TaskPublisher{
		Client:         client.Client,
		TraceManager:   client.TraceManager,
		MetricsManager: client.MetricsManager,
		Logger:         client.Logger,
		ComponentName:  "publisher",
	}
	// Publish task with automatic tracing
	data, _ := structpb.NewStruct(map[string]interface{}{
		"message": "Hello, observable world!",
	})
	task := &pb.TaskMessage{
		TaskId:   "task-123",
		TaskType: "greeting",
		Data:     data,
		Priority: pb.Priority_MEDIUM,
	}
	// Automatically traced and metered
	if err := publisher.PublishTask(ctx, task); err != nil {
		client.Logger.ErrorContext(ctx, "Failed to publish task", "error", err)
	} else {
		client.Logger.InfoContext(ctx, "Task published successfully", "task_id", task.TaskId)
	}
}
Subscriber Example with Observability
package main
import (
	"context"
	"os"
	"os/signal"
	"syscall"
	"github.com/owulveryck/agenthub/internal/agenthub"
	pb "github.com/owulveryck/agenthub/events/a2a"
	"google.golang.org/protobuf/types/known/structpb"
)
func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	// Observable client setup
	config := agenthub.NewGRPCConfig("subscriber")
	config.HealthPort = "8082"
	client, err := agenthub.NewAgentHubClient(config)
	if err != nil {
		panic(err)
	}
	defer client.Shutdown(context.Background())
	if err := client.Start(ctx); err != nil {
		panic(err)
	}
	// Create observable task subscriber
	subscriber := agenthub.NewTaskSubscriber(client, "my-subscriber")
	// Register handler with automatic tracing
	subscriber.RegisterHandler("greeting", func(ctx context.Context, task *pb.TaskMessage) (*structpb.Struct, pb.TaskStatus, string) {
		// This is automatically traced and logged
		client.Logger.InfoContext(ctx, "Processing greeting task", "task_id", task.TaskId)
		// Your processing logic
		result, _ := structpb.NewStruct(map[string]interface{}{
			"response": "Hello back!",
		})
		return result, pb.TaskStatus_COMPLETED, ""
	})
	// Start processing with automatic observability
	go subscriber.StartProcessing(ctx)
	// Graceful shutdown
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
	<-sigChan
}
Configuration Reference
📖 Complete Reference: For all environment variables and configuration options, see Environment Variables Reference
Key Environment Variables
| Variable | Description | Default | 
|---|
| JAEGER_ENDPOINT | Jaeger tracing endpoint | "" (tracing disabled) | 
| SERVICE_NAME | Service name for tracing | “agenthub-service” | 
| SERVICE_VERSION | Service version | “1.0.0” | 
| BROKER_HEALTH_PORT | Health endpoint port | “8080” | 
| AGENTHUB_BROKER_ADDR | Broker address | “localhost” | 
| AGENTHUB_BROKER_PORT | Broker port | “50051” | 
Health Endpoints
Each agent exposes these endpoints:
| Endpoint | Purpose | Response | 
|---|
| /health | Overall health status | JSON status | 
| /metrics | Prometheus metrics | Metrics format | 
| /ready | Readiness check | 200 OK or 503 | 
Troubleshooting
Common Issues
| Issue | Solution | 
|---|
| No traces in Jaeger | Set JAEGER_ENDPOINTenvironment variable | 
| Health endpoint not accessible | Check BROKER_HEALTH_PORTis unique | 
| Logs not structured | Ensure using client.Loggernot standardlog | 
| Missing correlation IDs | Use context.Contextin all operations | 
Verification Steps
- Check health endpoint: - curl http://localhost:8083/health
 
- Verify metrics: - curl http://localhost:8083/metrics
 
- Check traces in Jaeger: - Open http://localhost:16686
- Search for your service name
 
Migration from Manual Setup
If you have existing agents using manual observability setup:
Old Approach (Manual)
// 50+ lines of OpenTelemetry setup
obs, err := observability.NewObservability(config)
traceManager := observability.NewTraceManager(serviceName)
// Manual gRPC client setup
// Manual health server setup
New Approach (Unified)
// 3 lines - everything automatic
config := agenthub.NewGRPCConfig("my-agent")
client, err := agenthub.NewAgentHubClient(config)
client.Start(ctx)
The unified abstractions provide the same observability features with 90% less code and no manual setup required.
With AgentHub’s unified abstractions, observability is no longer an add-on feature but a built-in capability that comes automatically with every agent. Focus on your business logic while the platform handles monitoring, tracing, and health checks for you.
2 - How to Use Grafana Dashboards
Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.
How to Use Grafana Dashboards
Goal-oriented guide: Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.
Prerequisites
- AgentHub observability stack running (docker-compose up -d)
- AgentHub agents running with observability enabled
- Basic understanding of metrics concepts
- 10-15 minutes
Quick Access
- Grafana Dashboard: http://localhost:3333 (admin/admin)
- Direct Dashboard: http://localhost:3333/d/agenthub-eda-dashboard
Dashboard Overview
The AgentHub EDA System Observatory provides comprehensive monitoring across three main areas:
- Event Metrics (Top Row) - Event processing performance
- Distributed Tracing (Middle) - Request flow visualization
- System Health (Bottom Row) - Infrastructure monitoring
Panel-by-Panel Guide
🚀 Event Processing Rate (Top Left)
What it shows: Events processed per second by each service
How to use:
- Monitor throughput: See how many events your system processes
- Identify bottlenecks: Low rates may indicate performance issues
- Compare services: See which agents are busiest
Reading the chart:
Green line: agenthub-broker (150 events/sec)
Blue line:  agenthub-publisher (50 events/sec)
Red line:   agenthub-subscriber (145 events/sec)
Troubleshooting:
- Flat lines: No activity - check if agents are running
- Dropping rates: Performance degradation - check CPU/memory
- Spiky patterns: Bursty workloads - consider load balancing
🚨 Event Processing Error Rate (Top Right)
What it shows: Percentage of events that failed processing
How to use:
- Monitor reliability: Should stay below 5% (green zone)
- Alert threshold: Yellow above 5%, red above 10%
- Quick health check: Single glance system reliability
Color coding:
- Green (0-5%): Healthy system
- Yellow (5-10%): Moderate issues
- Red (>10%): Critical problems
Troubleshooting:
- High error rates: Check Jaeger for failing traces
- Sudden spikes: Look for recent deployments or config changes
- Persistent errors: Check logs for recurring issues
📈 Event Types Distribution (Middle Left)
What it shows: Breakdown of event types by volume
How to use:
- Understand workload: See what types of tasks dominate
- Capacity planning: Identify which task types need scaling
- Anomaly detection: Unusual distributions may indicate issues
Example interpretation:
greeting: 40% (blue) - Most common task type
math_calculation: 35% (green) - Heavy computational tasks
random_number: 20% (yellow) - Quick tasks
unknown_task: 5% (red) - Error-generating tasks
Troubleshooting:
- Missing task types: Check if specific agents are down
- Unexpected distributions: May indicate upstream issues
- Dominant error types: Focus optimization efforts
⏱️ Event Processing Latency (Middle Right)
What it shows: Processing time percentiles (p50, p95, p99)
How to use:
- Performance monitoring: Track how fast events are processed
- SLA compliance: Ensure latencies meet requirements
- Outlier detection: p99 shows worst-case scenarios
Understanding percentiles:
- p50 (median): 50% of events process faster than this
- p95: 95% of events process faster than this
- p99: 99% of events process faster than this
Healthy ranges:
- p50: < 50ms (very responsive)
- p95: < 200ms (good performance)
- p99: < 500ms (acceptable outliers)
Troubleshooting:
- Rising latencies: Check CPU/memory usage
- High p99: Look for resource contention or long-running tasks
- Flatlined metrics: May indicate measurement issues
🔍 Distributed Traces (Middle Section)
What it shows: Integration with Jaeger for trace visualization
How to use:
- Click “Explore” to open Jaeger
- Select service from dropdown
- Find specific traces to debug issues
- Analyze request flows across services
When to use:
- Debugging errors: Find root cause of failures
- Performance analysis: Identify slow operations
- Understanding flows: See complete request journeys
🖥️ Service CPU Usage (Bottom Left)
What it shows: CPU utilization by service
How to use:
- Capacity monitoring: Ensure services aren’t overloaded
- Resource planning: Identify when to scale
- Performance correlation: High CPU often explains high latency
Healthy ranges:
- < 50%: Comfortable utilization
- 50-70%: Moderate load
- > 70%: Consider scaling
💾 Service Memory Usage (Bottom Center)
What it shows: Memory consumption by service
How to use:
- Memory leak detection: Watch for continuously growing usage
- Capacity planning: Ensure sufficient memory allocation
- Garbage collection: High usage may impact performance
Monitoring tips:
- Steady growth: May indicate memory leaks
- Sawtooth pattern: Normal GC behavior
- Sudden spikes: Check for large event batches
🧵 Go Goroutines (Bottom Right)
What it shows: Number of concurrent goroutines per service
How to use:
- Concurrency monitoring: Track parallel processing
- Resource leak detection: Continuously growing numbers indicate leaks
- Performance tuning: Optimize concurrency levels
Normal patterns:
- Stable baseline: Normal operation
- Activity spikes: During high load
- Continuous growth: Potential goroutine leaks
🏥 Service Health Status (Bottom Far Right)
What it shows: Up/down status of each service
How to use:
- Quick status check: See if all services are running
- Outage detection: Immediately identify down services
- Health monitoring: Green = UP, Red = DOWN
Dashboard Variables and Filters
Service Filter
Location: Top of dashboard
Purpose: Filter metrics by specific services
Usage:
- Select “All” to see everything
- Choose specific services to focus analysis
- Useful for isolating problems to specific components
Event Type Filter
Location: Top of dashboard
Purpose: Filter by event/task types
Usage:
- Analyze specific workflow types
- Debug particular task categories
- Compare performance across task types
Time Range Selector
Location: Top right of dashboard
Purpose: Control time window for analysis
Common ranges:
- 5 minutes: Real-time monitoring
- 1 hour: Recent trend analysis
- 24 hours: Daily pattern analysis
- 7 days: Weekly trend and capacity planning
Advanced Usage Patterns
- Start with Overview: - Check error rates (should be < 5%)
- Verify processing rates look normal
- Scan for any red/yellow indicators
 
- Drill Down on Issues: - If high error rates → check distributed traces
- If high latency → examine CPU/memory usage
- If low throughput → check service health
 
- Root Cause Analysis: - Use time range selector to find when problems started
- Filter by specific services to isolate issues
- Correlate metrics across different panels
 
Capacity Planning Workflow
- Analyze Peak Patterns: - Set time range to 7 days
- Identify peak usage periods
- Note maximum throughput achieved
 
- Resource Utilization: - Check CPU usage during peaks
- Monitor memory consumption trends
- Verify goroutine scaling behavior
 
- Plan Scaling: - If CPU > 70% during peaks, scale up
- If memory continuously growing, investigate leaks
- If error rates spike during load, optimize before scaling
 
Troubleshooting Workflow
- Identify Symptoms: - High error rates: Focus on traces and logs
- High latency: Check resource utilization
- Low throughput: Verify service health
 
- Time Correlation: - Use time range to find when issues started
- Look for correlated changes across metrics
- Check for deployment or configuration changes
 
- Service Isolation: - Use service filter to identify problematic components
- Compare healthy vs unhealthy services
- Check inter-service dependencies
 
Dashboard Customization
Adding New Panels
- Click “+ Add panel” in top menu
- Choose visualization type:- Time series for trends
- Stat for current values
- Gauge for thresholds
 
- Configure query:# Example: Custom error rate
rate(my_custom_errors_total[5m]) / rate(my_custom_requests_total[5m]) * 100
 
Creating Alerts
- Edit existing panel or create new one
- Click “Alert” tab
- Configure conditions:Query: rate(event_errors_total[5m]) / rate(events_processed_total[5m]) * 100
Condition: IS ABOVE 5
Evaluation: Every 1m for 2m
 
- Set notification channels
Custom Time Ranges
- Click time picker (top right)
- Select “Custom range”
- Set specific dates/times for historical analysis
- Use “Refresh” settings for auto-updating
Troubleshooting Dashboard Issues
Dashboard Not Loading
# Check Grafana status
docker-compose ps grafana
# Check Grafana logs
docker-compose logs grafana
# Restart if needed
docker-compose restart grafana
No Data in Panels
# Check Prometheus connection
curl http://localhost:9090/api/v1/targets
# Verify agents are exposing metrics
curl http://localhost:8080/metrics
curl http://localhost:8081/metrics
curl http://localhost:8082/metrics
# Check Prometheus configuration
docker-compose logs prometheus
- Reduce time range: Use shorter windows for better performance
- Limit service selection: Filter to specific services
- Optimize queries: Use appropriate rate intervals
- Check resource usage: Ensure Prometheus has enough memory
Authentication Issues
- Default credentials: admin/admin
- Reset password: Through Grafana UI after first login
- Lost access: Restart Grafana container to reset
Best Practices
Regular Monitoring
- Check dashboard daily: Quick health overview
- Weekly reviews: Trend analysis and capacity planning
- Set up alerts: Proactive monitoring for critical metrics
- Use appropriate time ranges: Don’t query more data than needed
- Filter effectively: Use service and event type filters
- Refresh intervals: Balance real-time needs with performance
Team Usage
- Share dashboard URLs: Bookmark specific views
- Create annotations: Mark deployments and incidents
- Export snapshots: Share findings with team members
Jaeger Integration
- Click Explore in traces panel
- Auto-links to Jaeger with service context
- Correlate traces with metrics timeframes
Prometheus Integration
- Click Explore on any panel
- Edit queries in Prometheus query language
- Access raw metrics for custom analysis
Log Correlation
- Use trace IDs from Jaeger
- Search logs for matching trace IDs
- Correlate log events with metric spikes
🎯 Next Steps:
Deep Debugging: Debug with Distributed Tracing
Production Setup: Configure Alerts
Understanding: Observability Architecture Explained