1 - AgentHub Health Endpoints Reference
Complete documentation for AgentHub’s health monitoring APIs, endpoint specifications, status codes, and integration patterns.
AgentHub Health Endpoints Reference
Technical reference: Complete documentation for AgentHub’s health monitoring APIs, endpoint specifications, status codes, and integration patterns.
Overview
Every observable AgentHub service exposes standardized health endpoints for monitoring, load balancing, and operational management.
Standard Endpoints
Health Check Endpoint
/health
Purpose: Comprehensive service health status
Method: GET
Port: Service-specific (8080-8083)
Response Format:
{
  "status": "healthy|degraded|unhealthy",
  "timestamp": "2025-09-28T21:00:00.000Z",
  "service": "agenthub-broker",
  "version": "1.0.0",
  "uptime": "2h34m12s",
  "checks": [
    {
      "name": "self",
      "status": "healthy",
      "message": "Service is running normally",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "1.2ms"
    },
    {
      "name": "database_connection",
      "status": "healthy",
      "message": "Database connection is active",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "15.6ms"
    }
  ]
}
Status Codes:
- 200 OK- All checks healthy
- 503 Service Unavailable- One or more checks unhealthy
- 500 Internal Server Error- Health check system failure
Readiness Endpoint
/ready
Purpose: Service readiness for traffic acceptance
Method: GET
Response Format:
{
  "ready": true,
  "timestamp": "2025-09-28T21:00:00.000Z",
  "service": "agenthub-broker",
  "dependencies": [
    {
      "name": "grpc_server",
      "ready": true,
      "message": "gRPC server listening on :50051"
    },
    {
      "name": "observability",
      "ready": true,
      "message": "OpenTelemetry initialized"
    }
  ]
}
Status Codes:
- 200 OK- Service ready for traffic
- 503 Service Unavailable- Service not ready
Metrics Endpoint
/metrics
Purpose: Prometheus metrics exposure
Method: GET
Content-Type: text/plain
Response Format:
# HELP events_processed_total Total number of events processed
# TYPE events_processed_total counter
events_processed_total{service="agenthub-broker",event_type="greeting",success="true"} 1234
# HELP system_cpu_usage_percent CPU usage percentage
# TYPE system_cpu_usage_percent gauge
system_cpu_usage_percent{service="agenthub-broker"} 23.4
Status Codes:
- 200 OK- Metrics available
- 500 Internal Server Error- Metrics collection failure
Service-Specific Configurations
Broker (Port 8080)
Health Checks:
- self- Basic service health
- grpc_server- gRPC server status
- observability- OpenTelemetry health
Example URLs:
- Health: http://localhost:8080/health
- Ready: http://localhost:8080/ready
- Metrics: http://localhost:8080/metrics
Publisher (Port 8081)
Health Checks:
- self- Basic service health
- broker_connection- Connection to AgentHub broker
- observability- Tracing and metrics health
Example URLs:
- Health: http://localhost:8081/health
- Ready: http://localhost:8081/ready
- Metrics: http://localhost:8081/metrics
Subscriber (Port 8082)
Health Checks:
- self- Basic service health
- broker_connection- Connection to AgentHub broker
- task_processor- Task processing capability
- observability- Observability stack health
Example URLs:
- Health: http://localhost:8082/health
- Ready: http://localhost:8082/ready
- Metrics: http://localhost:8082/metrics
Custom Agents (Port 8083+)
Configurable Health Checks:
- Custom business logic checks
- External dependency checks
- Resource availability checks
Health Check Types
BasicHealthChecker
Purpose: Simple function-based health checks
Implementation:
checker := observability.NewBasicHealthChecker("database", func(ctx context.Context) error {
    return db.Ping()
})
healthServer.AddChecker("database", checker)
Use Cases:
- Database connectivity
- File system access
- Configuration validation
- Memory/disk space checks
GRPCHealthChecker
Purpose: gRPC connection health verification
Implementation:
checker := observability.NewGRPCHealthChecker("broker_connection", "localhost:50051")
healthServer.AddChecker("broker_connection", checker)
Use Cases:
- AgentHub broker connectivity
- External gRPC service dependencies
- Service mesh health
HTTPHealthChecker
Purpose: HTTP endpoint health verification
Implementation:
checker := observability.NewHTTPHealthChecker("api_gateway", "http://gateway:8080/health")
healthServer.AddChecker("api_gateway", checker)
Use Cases:
- REST API dependencies
- Web service health
- Load balancer backends
Custom Health Checkers
Interface:
type HealthChecker interface {
    Check(ctx context.Context) error
    Name() string
}
Custom Implementation Example:
type BusinessLogicChecker struct {
    name string
    validator func() error
}
func (c *BusinessLogicChecker) Check(ctx context.Context) error {
    return c.validator()
}
func (c *BusinessLogicChecker) Name() string {
    return c.name
}
// Usage
checker := &BusinessLogicChecker{
    name: "license_validation",
    validator: func() error {
        if time.Now().After(licenseExpiry) {
            return errors.New("license expired")
        }
        return nil
    },
}
Health Check Configuration
Check Intervals
Default Intervals:
- Active checks: Every 30 seconds
- On-demand checks: Per request
- Startup checks: During service initialization
Configurable Timing:
config := observability.HealthConfig{
    CheckInterval: 15 * time.Second,
    Timeout:       5 * time.Second,
    RetryCount:    3,
    RetryDelay:    1 * time.Second,
}
Timeout Configuration
Per-Check Timeouts:
checker := observability.NewBasicHealthChecker("slow_service",
    func(ctx context.Context) error {
        // This check will timeout after 10 seconds
        return slowOperation(ctx)
    }).WithTimeout(10 * time.Second)
Global Timeout:
healthServer := observability.NewHealthServer("8080", "my-service", "1.0.0")
healthServer.SetGlobalTimeout(30 * time.Second)
Integration Patterns
Kubernetes Integration
Liveness Probe
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
Readiness Probe
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2
Startup Probe
startupProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 30
Load Balancer Integration
HAProxy Configuration
backend agentHub_brokers
    balance roundrobin
    option httpchk GET /health
    server broker1 broker1:8080 check
    server broker2 broker2:8080 check
NGINX Configuration
upstream agenthub_backend {
    server broker1:8080;
    server broker2:8080;
}
location /health_check {
    proxy_pass http://agenthub_backend/health;
    proxy_set_header Host $host;
}
Prometheus Integration
Service Discovery
- job_name: 'agenthub-health'
  static_configs:
    - targets:
      - 'broker:8080'
      - 'publisher:8081'
      - 'subscriber:8082'
  metrics_path: '/metrics'
  scrape_interval: 10s
  scrape_timeout: 5s
Health Check Metrics
# Health check status (1=healthy, 0=unhealthy)
health_check_status{service="agenthub-broker",check="database"}
# Health check duration
health_check_duration_seconds{service="agenthub-broker",check="database"}
# Service uptime
service_uptime_seconds{service="agenthub-broker"}
Status Definitions
Service Status Levels
Healthy
Definition: All health checks passing
HTTP Status: 200 OK
Criteria:
- All registered checks return no error
- Service is fully operational
- All dependencies available
Degraded
Definition: Service operational but with limitations
HTTP Status: 200 OK (with warning indicators)
Criteria:
- Critical checks passing
- Non-critical checks may be failing
- Service can handle requests with reduced functionality
Unhealthy
Definition: Service cannot handle requests properly
HTTP Status: 503 Service Unavailable
Criteria:
- One or more critical checks failing
- Service should not receive new requests
- Requires intervention or automatic recovery
Check-Level Status
Passing
- Check completed successfully
- No errors detected
- Within acceptable parameters
Warning
- Check completed with minor issues
- Service functional but attention needed
- May indicate future problems
Critical
- Check failed
- Service functionality compromised
- Immediate attention required
Monitoring and Alerting
Critical Alerts
# Service down alert
- alert: ServiceHealthCheckFailing
  expr: health_check_status == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Service health check failing"
    description: "{{ $labels.service }} health check {{ $labels.check }} is failing"
# Service not ready alert
- alert: ServiceNotReady
  expr: up{job=~"agenthub-.*"} == 0
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Service not responding"
    description: "{{ $labels.instance }} is not responding to health checks"
Warning Alerts
# Slow health checks
- alert: SlowHealthChecks
  expr: health_check_duration_seconds > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Health checks taking too long"
    description: "{{ $labels.service }} health check {{ $labels.check }} taking {{ $value }}s"
# Service degraded
- alert: ServiceDegraded
  expr: service_status == 1  # degraded status
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Service running in degraded mode"
    description: "{{ $labels.service }} is degraded but still operational"
API Response Examples
Healthy Service Response
curl http://localhost:8080/health
{
  "status": "healthy",
  "timestamp": "2025-09-28T21:00:00.000Z",
  "service": "agenthub-broker",
  "version": "1.0.0",
  "uptime": "2h34m12s",
  "checks": [
    {
      "name": "self",
      "status": "healthy",
      "message": "Service is running normally",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "1.2ms"
    },
    {
      "name": "grpc_server",
      "status": "healthy",
      "message": "gRPC server listening on :50051",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "0.8ms"
    },
    {
      "name": "observability",
      "status": "healthy",
      "message": "OpenTelemetry exporter connected",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "12.4ms"
    }
  ]
}
Unhealthy Service Response
curl http://localhost:8080/health
{
  "status": "unhealthy",
  "timestamp": "2025-09-28T21:00:00.000Z",
  "service": "agenthub-broker",
  "version": "1.0.0",
  "uptime": "2h34m12s",
  "checks": [
    {
      "name": "self",
      "status": "healthy",
      "message": "Service is running normally",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "1.2ms"
    },
    {
      "name": "grpc_server",
      "status": "unhealthy",
      "message": "Failed to bind to port :50051: address already in use",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "0.1ms"
    },
    {
      "name": "observability",
      "status": "healthy",
      "message": "OpenTelemetry exporter connected",
      "last_checked": "2025-09-28T21:00:00.000Z",
      "duration": "12.4ms"
    }
  ]
}
Best Practices
Health Check Design
- Fast Execution: Keep checks under 5 seconds
- Meaningful Tests: Test actual functionality, not just process existence
- Idempotent Operations: Checks should not modify system state
- Appropriate Timeouts: Set reasonable timeouts for external dependencies
- Clear Messages: Provide actionable error messages
Dependency Management
- Critical vs Non-Critical: Distinguish between essential and optional dependencies
- Cascade Prevention: Avoid cascading failures through dependency chains
- Circuit Breakers: Implement circuit breakers for flaky dependencies
- Graceful Degradation: Continue operating when non-critical dependencies fail
Operational Considerations
- Monitoring: Set up alerts for health check failures
- Documentation: Document what each health check validates
- Testing: Test health checks in development and staging
- Versioning: Version health check APIs for compatibility
🎯 Next Steps:
Implementation: Add Observability to Your Agent
Monitoring: Use Grafana Dashboards
Metrics: Observability Metrics Reference
2 - AgentHub Observability Metrics Reference
Complete catalog of all metrics exposed by AgentHub’s observability system, their meanings, usage patterns, and query examples.
AgentHub Observability Metrics Reference
Technical reference: Complete catalog of all metrics exposed by AgentHub’s observability system, their meanings, usage patterns, and query examples.
Overview
AgentHub automatically collects 47+ distinct metrics across all observable services, providing comprehensive visibility into event processing, system health, and performance characteristics.
Metric Categories
A2A Message Processing Metrics
a2a_messages_processed_total
Type: Counter
Description: Total number of A2A messages processed by service
Labels:
- service- Service name (agenthub, publisher, subscriber)
- message_type- Type of A2A message (task_update, message, artifact)
- success- Processing success (true/false)
- context_id- A2A conversation context (for workflow tracking)
Usage:
# A2A message processing rate per service
rate(a2a_messages_processed_total[5m])
# Success rate by A2A message type
rate(a2a_messages_processed_total{success="true"}[5m]) / rate(a2a_messages_processed_total[5m]) * 100
# Error rate across all A2A services
rate(a2a_messages_processed_total{success="false"}[5m]) / rate(a2a_messages_processed_total[5m]) * 100
# Workflow processing rate by context
rate(a2a_messages_processed_total[5m]) by (context_id)
a2a_messages_published_total
Type: Counter
Description: Total number of A2A messages published by agents
Labels:
- message_type- Type of A2A message published
- from_agent_id- Publishing agent identifier
- to_agent_id- Target agent identifier (empty for broadcast)
Usage:
# A2A publishing rate by message type
rate(a2a_messages_published_total[5m]) by (message_type)
# Most active A2A publishers
topk(5, rate(a2a_messages_published_total[5m]) by (from_agent_id))
# Broadcast vs direct messaging ratio
rate(a2a_messages_published_total{to_agent_id=""}[5m]) / rate(a2a_messages_published_total[5m])
a2a_message_processing_duration_seconds
Type: Histogram
Description: Time taken to process A2A messages
Labels:
- service- Service processing the message
- message_type- Type of A2A message being processed
- task_state- Current A2A task state (for task-related messages)
Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Usage:
# p95 A2A message processing latency
histogram_quantile(0.95, rate(a2a_message_processing_duration_seconds_bucket[5m]))
# p99 latency by service
histogram_quantile(0.99, rate(a2a_message_processing_duration_seconds_bucket[5m])) by (service)
# Average A2A processing time by task state
rate(a2a_message_processing_duration_seconds_sum[5m]) / rate(a2a_message_processing_duration_seconds_count[5m]) by (task_state)
a2a_message_errors_total
Type: Counter
Description: Total number of A2A message processing errors
Labels:
- service- Service where error occurred
- message_type- Type of A2A message that failed
- error_type- Category of error (grpc_error, validation_error, protocol_error, etc.)
- a2a_version- A2A protocol version for compatibility tracking
Usage:
# A2A error rate by error type
rate(a2a_message_errors_total[5m]) by (error_type)
# Services with highest A2A error rates
topk(3, rate(a2a_message_errors_total[5m]) by (service))
# A2A protocol version compatibility issues
rate(a2a_message_errors_total{error_type="protocol_error"}[5m]) by (a2a_version)
AgentHub Broker Metrics
agenthub_connections_total
Type: Gauge
Description: Number of active agent connections to AgentHub broker
Labels:
- connection_type- Type of connection (a2a_publisher, a2a_subscriber, unified)
- agent_type- Classification of connected agent
Usage:
# Current AgentHub connection count
agenthub_connections_total
# A2A connection growth over time
increase(agenthub_connections_total[1h])
# Connection distribution by type
agenthub_connections_total by (connection_type)
agenthub_subscriptions_total
Type: Gauge
Description: Number of active A2A message subscriptions
Labels:
- agent_id- Subscriber agent identifier
- subscription_type- Type of A2A subscription (tasks, messages, agent_events)
- filter_criteria- Applied subscription filters (task_types, states, etc.)
Usage:
# Total active A2A subscriptions
sum(agenthub_subscriptions_total)
# A2A subscriptions by agent
sum(agenthub_subscriptions_total) by (agent_id)
# Most popular A2A subscription types
sum(agenthub_subscriptions_total) by (subscription_type)
# Filtered vs unfiltered subscriptions
sum(agenthub_subscriptions_total{filter_criteria!=""}) / sum(agenthub_subscriptions_total)
agenthub_message_routing_duration_seconds
Type: Histogram
Description: Time taken to route A2A messages through AgentHub broker
Labels:
- routing_type- Type of routing (direct, broadcast, filtered)
- message_size_bucket- Message size classification (small, medium, large)
Buckets: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1
Usage:
# AgentHub A2A routing latency percentiles
histogram_quantile(0.95, rate(agenthub_message_routing_duration_seconds_bucket[5m]))
# A2A routing performance by type
rate(agenthub_message_routing_duration_seconds_sum[5m]) / rate(agenthub_message_routing_duration_seconds_count[5m]) by (routing_type)
# Message size impact on routing
histogram_quantile(0.95, rate(agenthub_message_routing_duration_seconds_bucket[5m])) by (message_size_bucket)
agenthub_queue_size
Type: Gauge
Description: Number of A2A messages queued awaiting routing
Labels:
- queue_type- Type of queue (incoming, outgoing, dead_letter, retry)
- priority- Message priority level
- context_active- Whether messages belong to active A2A contexts
Usage:
# Current A2A queue sizes
agenthub_queue_size by (queue_type)
# A2A queue growth rate
rate(agenthub_queue_size[5m])
# Priority queue distribution
agenthub_queue_size by (priority)
# Active context message backlog
agenthub_queue_size{context_active="true"}
System Health Metrics
system_cpu_usage_percent
Type: Gauge
Description: CPU utilization percentage
Labels:
Usage:
# Current CPU usage
system_cpu_usage_percent
# High CPU services
system_cpu_usage_percent > 80
# Average CPU over time
avg_over_time(system_cpu_usage_percent[1h])
system_memory_usage_bytes
Type: Gauge
Description: Memory usage in bytes
Labels:
- service- Service name
- type- Memory type (heap, stack, total)
Usage:
# Memory usage in MB
system_memory_usage_bytes / 1024 / 1024
# Memory growth rate
rate(system_memory_usage_bytes[10m])
# Memory usage by type
system_memory_usage_bytes by (type)
system_goroutines_total
Type: Gauge
Description: Number of active goroutines
Labels:
Usage:
# Current goroutine count
system_goroutines_total
# Goroutine leaks detection
increase(system_goroutines_total[1h]) > 1000
# Goroutine efficiency
system_goroutines_total / system_cpu_usage_percent
system_file_descriptors_used
Type: Gauge
Description: Number of open file descriptors
Labels:
Usage:
# Current FD usage
system_file_descriptors_used
# FD growth rate
rate(system_file_descriptors_used[5m])
A2A Task-Specific Metrics
a2a_tasks_created_total
Type: Counter
Description: Total number of A2A tasks created
Labels:
- task_type- Type classification of the task
- context_id- A2A conversation context
- priority- Task priority level
Usage:
# A2A task creation rate
rate(a2a_tasks_created_total[5m])
# Task creation by type
rate(a2a_tasks_created_total[5m]) by (task_type)
# High priority task rate
rate(a2a_tasks_created_total{priority="PRIORITY_HIGH"}[5m])
a2a_task_state_transitions_total
Type: Counter
Description: Total number of A2A task state transitions
Labels:
- from_state- Previous task state
- to_state- New task state
- task_type- Type of task transitioning
Usage:
# Task completion rate
rate(a2a_task_state_transitions_total{to_state="TASK_STATE_COMPLETED"}[5m])
# Task failure rate
rate(a2a_task_state_transitions_total{to_state="TASK_STATE_FAILED"}[5m])
# Task state transition patterns
rate(a2a_task_state_transitions_total[5m]) by (from_state, to_state)
a2a_task_duration_seconds
Type: Histogram
Description: Duration of A2A task execution from submission to completion
Labels:
- task_type- Type of task
- final_state- Final task state (COMPLETED, FAILED, CANCELLED)
Buckets: 0.1, 0.5, 1, 5, 10, 30, 60, 300, 600, 1800
Usage:
# A2A task completion time percentiles
histogram_quantile(0.95, rate(a2a_task_duration_seconds_bucket{final_state="TASK_STATE_COMPLETED"}[5m]))
# Task duration by type
histogram_quantile(0.50, rate(a2a_task_duration_seconds_bucket[5m])) by (task_type)
# Failed vs successful task duration comparison
histogram_quantile(0.95, rate(a2a_task_duration_seconds_bucket[5m])) by (final_state)
a2a_artifacts_produced_total
Type: Counter
Description: Total number of A2A artifacts produced by completed tasks
Labels:
- artifact_type- Type of artifact (data, file, text)
- task_type- Type of task that produced the artifact
- artifact_size_bucket- Size classification of artifact
Usage:
# Artifact production rate
rate(a2a_artifacts_produced_total[5m])
# Artifacts by type
rate(a2a_artifacts_produced_total[5m]) by (artifact_type)
# Large artifact production rate
rate(a2a_artifacts_produced_total{artifact_size_bucket="large"}[5m])
gRPC Metrics
grpc_server_started_total
Type: Counter
Description: Total number of RPCs started on the AgentHub server
Labels:
- grpc_method- gRPC method name (PublishMessage, SubscribeToTasks, etc.)
- grpc_service- gRPC service name (AgentHub)
Usage:
# AgentHub RPC request rate
rate(grpc_server_started_total[5m])
# Most called A2A methods
topk(5, rate(grpc_server_started_total[5m]) by (grpc_method))
# A2A vs EDA method usage
rate(grpc_server_started_total{grpc_method=~".*Message.*|.*Task.*"}[5m])
grpc_server_handled_total
Type: Counter
Description: Total number of RPCs completed on the AgentHub server
Labels:
- grpc_method- gRPC method name
- grpc_service- gRPC service name (AgentHub)
- grpc_code- gRPC status code
- a2a_operation- A2A operation type (publish, subscribe, get, cancel)
Usage:
# AgentHub RPC success rate
rate(grpc_server_handled_total{grpc_code="OK"}[5m]) / rate(grpc_server_handled_total[5m]) * 100
# A2A operation error rate
rate(grpc_server_handled_total{grpc_code!="OK"}[5m]) by (a2a_operation)
# A2A method-specific success rates
rate(grpc_server_handled_total{grpc_code="OK"}[5m]) / rate(grpc_server_handled_total[5m]) by (grpc_method)
grpc_server_handling_seconds
Type: Histogram
Description: Histogram of response latency of AgentHub RPCs
Labels:
- grpc_method- gRPC method name
- grpc_service- gRPC service name (AgentHub)
- a2a_operation- A2A operation type
Usage:
# AgentHub gRPC latency percentiles
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket[5m]))
# Slow A2A operations
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket[5m])) by (a2a_operation) > 0.1
# A2A method performance comparison
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket[5m])) by (grpc_method)
Health Check Metrics
health_check_status
Type: Gauge
Description: Health check status (1=healthy, 0=unhealthy)
Labels:
- service- Service name
- check_name- Name of the health check
- endpoint- Health check endpoint
Usage:
# Unhealthy services
health_check_status == 0
# Health check success rate
avg_over_time(health_check_status[5m])
health_check_duration_seconds
Type: Histogram
Description: Time taken to execute health checks
Labels:
- service- Service name
- check_name- Name of the health check
Usage:
# Health check latency
histogram_quantile(0.95, rate(health_check_duration_seconds_bucket[5m]))
# Slow health checks
histogram_quantile(0.95, rate(health_check_duration_seconds_bucket[5m])) by (check_name) > 0.5
OpenTelemetry Metrics
otelcol_processor_batch_batch_send_size_count
Type: Counter
Description: Number of batches sent by OTEL collector
Labels: None
otelcol_exporter_sent_spans
Type: Counter
Description: Number of spans sent to tracing backend
Labels:
- exporter- Exporter name (jaeger, otlp)
Usage:
# Span export rate
rate(otelcol_exporter_sent_spans[5m])
# Export success by backend
rate(otelcol_exporter_sent_spans[5m]) by (exporter)
Common Query Patterns
# Top 5 slowest A2A message types
topk(5,
  histogram_quantile(0.95,
    rate(a2a_message_processing_duration_seconds_bucket[5m])
  ) by (message_type)
)
# A2A task completion time analysis
histogram_quantile(0.95,
  rate(a2a_task_duration_seconds_bucket{final_state="TASK_STATE_COMPLETED"}[5m])
) by (task_type)
# Services exceeding A2A latency SLA (>500ms p95)
histogram_quantile(0.95,
  rate(a2a_message_processing_duration_seconds_bucket[5m])
) by (service) > 0.5
# A2A throughput efficiency (messages per CPU percent)
rate(a2a_messages_processed_total[5m]) / system_cpu_usage_percent
# Task success rate by type
rate(a2a_task_state_transitions_total{to_state="TASK_STATE_COMPLETED"}[5m]) /
rate(a2a_tasks_created_total[5m]) by (task_type)
A2A Error Analysis
# A2A message error rate by service over time
rate(a2a_message_errors_total[5m]) / rate(a2a_messages_processed_total[5m]) * 100
# A2A task failure rate
rate(a2a_task_state_transitions_total{to_state="TASK_STATE_FAILED"}[5m]) /
rate(a2a_tasks_created_total[5m]) * 100
# Most common A2A error types
topk(5, rate(a2a_message_errors_total[5m]) by (error_type))
# A2A protocol compatibility issues
rate(a2a_message_errors_total{error_type="protocol_error"}[5m]) by (a2a_version)
# Services with increasing A2A error rates
increase(a2a_message_errors_total[1h]) by (service) > 10
A2A Capacity Planning
# Peak hourly A2A message throughput
max_over_time(
  rate(a2a_messages_processed_total[5m])[1h:]
) * 3600
# Peak A2A task creation rate
max_over_time(
  rate(a2a_tasks_created_total[5m])[1h:]
) * 3600
# Resource utilization during peak A2A load
(
  max_over_time(system_cpu_usage_percent[1h:]) +
  max_over_time(system_memory_usage_bytes[1h:] / 1024 / 1024 / 1024)
) by (service)
# AgentHub connection scaling needs
max_over_time(agenthub_connections_total[24h:])
# A2A queue depth trends
max_over_time(agenthub_queue_size[24h:]) by (queue_type)
A2A System Health
# Overall A2A system health score (0-1)
avg(health_check_status)
# A2A services with degraded performance
(
  system_cpu_usage_percent > 70 or
  system_memory_usage_bytes > 1e9 or
  rate(a2a_message_errors_total[5m]) / rate(a2a_messages_processed_total[5m]) > 0.05
)
# A2A task backlog health
agenthub_queue_size{queue_type="incoming"} > 1000
# A2A protocol health indicators
rate(a2a_task_state_transitions_total{to_state="TASK_STATE_FAILED"}[5m]) /
rate(a2a_tasks_created_total[5m]) > 0.1
# Resource leak detection
increase(system_goroutines_total[1h]) > 1000 or
increase(system_file_descriptors_used[1h]) > 100
Alert Rule Examples
Critical A2A Alerts
# High A2A message processing error rate alert
- alert: HighA2AMessageProcessingErrorRate
  expr: |
    (
      rate(a2a_message_errors_total[5m]) /
      rate(a2a_messages_processed_total[5m])
    ) * 100 > 10    
  for: 2m
  annotations:
    summary: "High A2A message processing error rate"
    description: "{{ $labels.service }} has {{ $value }}% A2A error rate"
# High A2A task failure rate alert
- alert: HighA2ATaskFailureRate
  expr: |
    (
      rate(a2a_task_state_transitions_total{to_state="TASK_STATE_FAILED"}[5m]) /
      rate(a2a_tasks_created_total[5m])
    ) * 100 > 15    
  for: 3m
  annotations:
    summary: "High A2A task failure rate"
    description: "{{ $value }}% of A2A tasks are failing for task type {{ $labels.task_type }}"
# AgentHub service down alert
- alert: AgentHubServiceDown
  expr: health_check_status == 0
  for: 1m
  annotations:
    summary: "AgentHub service health check failing"
    description: "{{ $labels.service }} health check {{ $labels.check_name }} is failing"
# A2A queue backlog alert
- alert: A2AQueueBacklog
  expr: agenthub_queue_size{queue_type="incoming"} > 1000
  for: 5m
  annotations:
    summary: "A2A message queue backlog"
    description: "AgentHub has {{ $value }} messages queued"
A2A Warning Alerts
# High A2A message processing latency warning
- alert: HighA2AMessageProcessingLatency
  expr: |
    histogram_quantile(0.95,
      rate(a2a_message_processing_duration_seconds_bucket[5m])
    ) > 0.5    
  for: 5m
  annotations:
    summary: "High A2A message processing latency"
    description: "{{ $labels.service }} A2A p95 latency is {{ $value }}s"
# Slow A2A task completion warning
- alert: SlowA2ATaskCompletion
  expr: |
    histogram_quantile(0.95,
      rate(a2a_task_duration_seconds_bucket{final_state="TASK_STATE_COMPLETED"}[5m])
    ) > 300    
  for: 10m
  annotations:
    summary: "Slow A2A task completion"
    description: "A2A tasks of type {{ $labels.task_type }} taking {{ $value }}s to complete"
# High CPU usage warning
- alert: HighCPUUsage
  expr: system_cpu_usage_percent > 80
  for: 5m
  annotations:
    summary: "High CPU usage"
    description: "{{ $labels.service }} CPU usage is {{ $value }}%"
# A2A protocol version compatibility warning
- alert: A2AProtocolVersionMismatch
  expr: |
    rate(a2a_message_errors_total{error_type="protocol_error"}[5m]) > 0.1    
  for: 3m
  annotations:
    summary: "A2A protocol version compatibility issues"
    description: "A2A protocol errors detected for version {{ $labels.a2a_version }}"
Metric Retention and Storage
Retention Policies
- Raw metrics: 15 days at 15-second resolution
- 5m averages: 60 days
- 1h averages: 1 year
- 1d averages: 5 years
Storage Requirements
- Per service: ~2MB/day for all metrics
- Complete system: ~10MB/day for 5 services
- 1 year retention: ~3.6GB total
- Scrape interval: 10 seconds (configurable)
- Evaluation interval: 15 seconds for alerts
- Query timeout: 30 seconds
- Max samples: 50M per query
Integration Examples
Grafana Dashboard Variables
{
  "service": {
    "query": "label_values(a2a_messages_processed_total, service)",
    "refresh": "on_time_range_changed"
  },
  "message_type": {
    "query": "label_values(a2a_messages_processed_total{service=\"$service\"}, message_type)",
    "refresh": "on_dashboard_load"
  },
  "task_type": {
    "query": "label_values(a2a_tasks_created_total, task_type)",
    "refresh": "on_dashboard_load"
  },
  "context_id": {
    "query": "label_values(a2a_messages_processed_total{service=\"$service\"}, context_id)",
    "refresh": "on_dashboard_load"
  }
}
Custom A2A Application Metrics
// Register custom A2A counter
a2aCustomCounter, err := meter.Int64Counter(
    "a2a_custom_business_metric_total",
    metric.WithDescription("Custom A2A business metric"),
)
// Increment with A2A context and labels
a2aCustomCounter.Add(ctx, 1, metric.WithAttributes(
    attribute.String("task_type", "custom_analysis"),
    attribute.String("context_id", contextID),
    attribute.String("agent_type", "analytics_agent"),
    attribute.String("a2a_version", "1.0"),
))
// Register A2A task-specific histogram
a2aTaskHistogram, err := meter.Float64Histogram(
    "a2a_custom_task_processing_seconds",
    metric.WithDescription("Custom A2A task processing time"),
    metric.WithUnit("s"),
)
// Record A2A task timing
start := time.Now()
// ... process A2A task ...
duration := time.Since(start).Seconds()
a2aTaskHistogram.Record(ctx, duration, metric.WithAttributes(
    attribute.String("task_type", taskType),
    attribute.String("task_state", "TASK_STATE_COMPLETED"),
))
Troubleshooting Metrics
Missing Metrics Checklist
- ✅ Service built with -tags observability
- ✅ Prometheus can reach metrics endpoint
- ✅ Correct port in Prometheus config
- ✅ Service is actually processing events
- ✅ OpenTelemetry exporter configured correctly
High Cardinality Warning
Avoid metrics with unbounded label values:
- ❌ User IDs as labels (millions of values)
- ❌ Timestamps as labels
- ❌ Request IDs as labels
- ✅ Event types (limited set)
- ✅ Service names (limited set)
- ✅ Status codes (limited set)
🎯 Next Steps:
Implementation: Add Observability to Your Agent
Monitoring: Use Grafana Dashboards
Understanding: Distributed Tracing Explained