Distributed Tracing & OpenTelemetry
π Distributed Tracing & OpenTelemetry
Understanding-oriented: Deep dive into distributed tracing concepts, OpenTelemetry architecture, and how AgentHub implements comprehensive observability for event-driven systems.
The Problem: Observing Distributed Systems
Traditional monolithic applications are relatively easy to debugβeverything happens in one process, on one machine, with one log file. But modern event-driven architectures like AgentHub present unique challenges:
The Complexity of Event-Driven Systems
Request Flow in AgentHub:
User β Publisher Agent β AgentHub Broker β Subscriber Agent β Result β Publisher Agent
Each step involves:
- Different processes (potentially on different machines)
- Asynchronous communication (events, not direct calls)
- Multiple protocol layers (gRPC, HTTP, network)
- Independent failure modes (network partitions, service crashes)
- Varying performance characteristics (CPU, memory, I/O)
Traditional Debugging Challenges
Without distributed tracing:
Publisher logs:   "Published task task_123 at 10:00:01"
Broker logs:     "Received task from agent_pub at 10:00:01"
                 "Routed task to agent_sub at 10:00:01"
Subscriber logs: "Processing task task_456 at 10:00:02"
                 "Completed task task_789 at 10:00:03"
Questions you can’t answer:
- Which subscriber processed task_123?
- How long did task_123 take end-to-end?
- Where did task_123 fail?
- What was the complete flow for a specific request?
The Solution: Distributed Tracing
Distributed tracing solves these problems by creating a unified view of requests as they flow through multiple services.
Core Concepts
Trace
A trace represents a complete request journey through the system. In AgentHub, a trace might represent:
- Publishing a task
- Processing the task
- Publishing the result
- Receiving the result
Trace ID: a1b2c3d4e5f67890
Duration: 150ms
Services: 3 (publisher, broker, subscriber)
Spans: 5
Status: Success
Span
A span represents a single operation within a trace. Each span has:
- Name: What operation it represents
- Start/End time: When it happened
- Tags: Metadata about the operation
- Logs: Events that occurred during the operation
- Status: Success, error, or timeout
Span: "publish_event"
  Service: agenthub-publisher
  Duration: 25ms
  Tags:
    event.type: "greeting"
    event.id: "task_123"
    responder.agent: "agent_demo_subscriber"
  Status: OK
Span Context
The glue that connects spans across service boundaries. Contains:
- Trace ID: Unique identifier for the entire request
- Span ID: Unique identifier for the current operation
- Trace Flags: Sampling decisions, debug mode, etc.
How Tracing Works in AgentHub
1. Trace Initiation
When a publisher creates a task, it starts a new trace:
// Publisher starts a trace
ctx, span := tracer.Start(ctx, "publish_event")
defer span.End()
// Add metadata
span.SetAttributes(
    attribute.String("event.type", "greeting"),
    attribute.String("event.id", taskID),
)
2. Context Propagation
The trace context is injected into the task metadata:
// Inject trace context into task headers
headers := make(map[string]string)
otel.GetTextMapPropagator().Inject(ctx, propagation.MapCarrier(headers))
// Embed headers in task metadata
task.Metadata = &structpb.Struct{
    Fields: map[string]*structpb.Value{
        "trace_headers": structpb.NewStructValue(&structpb.Struct{
            Fields: stringMapToStructFields(headers),
        }),
    },
}
3. Context Extraction
The broker and subscriber extract the trace context:
// Extract trace context from task metadata
if metadata := task.GetMetadata(); metadata != nil {
    if traceHeaders, ok := metadata.Fields["trace_headers"]; ok {
        headers := structFieldsToStringMap(traceHeaders.GetStructValue().Fields)
        ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.MapCarrier(headers))
    }
}
// Continue the trace
ctx, span := tracer.Start(ctx, "process_event")
defer span.End()
4. Complete Request Flow
The result is a complete trace showing the entire request journey:
Trace: a1b2c3d4e5f67890
βββ publish_event (agenthub-publisher) [25ms]
β   βββ event.type: greeting
β   βββ event.id: task_123
βββ route_task (agenthub-broker) [2ms]
β   βββ source.agent: agent_demo_publisher
β   βββ target.agent: agent_demo_subscriber
βββ consume_event (agenthub-subscriber) [5ms]
β   βββ messaging.operation: receive
βββ process_task (agenthub-subscriber) [98ms]
β   βββ task.type: greeting
β   βββ task.parameter.name: Claude
β   βββ processing.status: completed
βββ publish_result (agenthub-subscriber) [20ms]
    βββ result.status: success
OpenTelemetry Architecture
OpenTelemetry is the observability framework that powers AgentHub’s tracing implementation.
The OpenTelemetry Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β                    Applications                         β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
β  β  Publisher  β β   Broker    β β Subscriber  β      β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
βββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββ¬ββββββ
                  β               β               β
βββββββββββββββββββΌββββββββββββββββΌββββββββββββββββΌββββββ
β              OpenTelemetry SDK                        β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
β  β   Tracer    β β    Meter    β β   Logger    β      β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
                                  β
βββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββ
β            OpenTelemetry Collector                     β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
β  β  Receivers  β β Processors  β β  Exporters  β      β
β  βββββββββββββββ βββββββββββββββ βββββββββββββββ      β
βββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββ¬ββββββ
                  β               β               β
βββββββββββββββββββΌββββββ βββββββββΌββββββββ βββββββΌββββββ
β      Jaeger           β β  Prometheus   β β   Logs    β
β   (Tracing)           β β  (Metrics)    β β(Logging)  β
βββββββββββββββββββββββββ βββββββββββββββββ βββββββββββββ
Core Components
Tracer
Creates and manages spans:
tracer := otel.Tracer("agenthub-publisher")
ctx, span := tracer.Start(ctx, "publish_event")
defer span.End()
Meter
Creates and manages metrics:
meter := otel.Meter("agenthub-publisher")
counter, _ := meter.Int64Counter("events_published_total")
counter.Add(ctx, 1)
Propagators
Handle context propagation across service boundaries:
// Inject context
otel.GetTextMapPropagator().Inject(ctx, carrier)
// Extract context
ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)
Exporters
Send telemetry data to backend systems:
- OTLP Exporter: Sends to OpenTelemetry Collector
- Jaeger Exporter: Sends directly to Jaeger
- Prometheus Exporter: Exposes metrics for Prometheus
AgentHub’s OpenTelemetry Implementation
Configuration
func NewObservability(config Config) (*Observability, error) {
    // Create resource (service identification)
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(config.ServiceName),
            semconv.ServiceVersion(config.ServiceVersion),
        ),
    )
    // Setup tracing
    traceExporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint(config.JaegerEndpoint),
        otlptracegrpc.WithInsecure(),
    )
    tracerProvider := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )
    otel.SetTracerProvider(tracerProvider)
    // Setup metrics
    meterProvider := sdkmetric.NewMeterProvider(
        sdkmetric.WithResource(res),
        sdkmetric.WithReader(promExporter),
    )
    otel.SetMeterProvider(meterProvider)
}
Custom slog Handler Integration
AgentHub’s custom logging handler automatically correlates logs with traces:
func (h *ObservabilityHandler) Handle(ctx context.Context, r slog.Record) error {
    // Extract trace context
    if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
        spanCtx := span.SpanContext()
        attrs = append(attrs,
            slog.String("trace_id", spanCtx.TraceID().String()),
            slog.String("span_id", spanCtx.SpanID().String()),
        )
    }
    // Structured log output with trace correlation
    logData := map[string]interface{}{
        "time":     r.Time.Format(time.RFC3339),
        "level":    r.Level.String(),
        "msg":      r.Message,
        "trace_id": spanCtx.TraceID().String(),
        "span_id":  spanCtx.SpanID().String(),
        "service":  h.serviceName,
    }
}
Observability Patterns in Event-Driven Systems
Pattern 1: Event Correlation
Challenge: Correlating events across async boundaries Solution: Inject trace context into event metadata
// Publisher injects context
headers := make(map[string]string)
otel.GetTextMapPropagator().Inject(ctx, propagation.MapCarrier(headers))
event.Metadata["trace_headers"] = headers
// Consumer extracts context
ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.MapCarrier(event.Metadata["trace_headers"]))
Pattern 2: Async Operation Tracking
Challenge: Tracking operations that complete asynchronously Solution: Create child spans that can outlive their parents
// Start async operation
ctx, span := tracer.Start(ctx, "async_operation")
go func() {
    defer span.End()
    // Long-running async work
    processTask()
    span.SetStatus(2, "") // Success
}()
// Parent can continue/return immediately
Pattern 3: Error Propagation
Challenge: Understanding how errors flow through the system Solution: Record errors at each span and propagate error status
if err != nil {
    span.RecordError(err)
    span.SetStatus(1, err.Error()) // Error status
    // Optionally add error details
    span.SetAttributes(
        attribute.String("error.type", "validation_error"),
        attribute.String("error.message", err.Error()),
    )
}
Pattern 4: Performance Attribution
Challenge: Understanding where time is spent in complex flows Solution: Detailed span hierarchy with timing
// High-level operation
ctx, span := tracer.Start(ctx, "process_task")
defer span.End()
// Sub-operations
ctx, validateSpan := tracer.Start(ctx, "validate_input")
// ... validation logic
validateSpan.End()
ctx, computeSpan := tracer.Start(ctx, "compute_result")
// ... computation logic
computeSpan.End()
ctx, persistSpan := tracer.Start(ctx, "persist_result")
// ... persistence logic
persistSpan.End()
Benefits of AgentHub’s Observability Implementation
1. Complete Request Visibility
- See every step of event processing
- Understand inter-service dependencies
- Track request flows across async boundaries
2. Performance Analysis
- Identify bottlenecks in event processing
- Understand where time is spent
- Optimize critical paths
3. Error Diagnosis
- Pinpoint exactly where failures occur
- Understand error propagation patterns
- Correlate errors with system state
4. Capacity Planning
- Understand system throughput characteristics
- Identify scaling bottlenecks
- Plan resource allocation
5. Troubleshooting
- Correlate logs, metrics, and traces
- Understand system behavior under load
- Debug complex distributed issues
Advanced Tracing Concepts
Sampling
Not every request needs to be traced. Sampling reduces overhead:
// Probability sampling (trace 10% of requests)
sdktrace.WithSampler(sdktrace.ParentBased(
    sdktrace.TraceIDRatioBased(0.1),
))
// Rate limiting sampling (max 100 traces/second)
sdktrace.WithSampler(sdktrace.ParentBased(
    sdktrace.RateLimited(100),
))
Custom Attributes
Add business context to spans:
span.SetAttributes(
    attribute.String("user.id", userID),
    attribute.String("tenant.id", tenantID),
    attribute.Int("batch.size", len(items)),
    attribute.String("workflow.type", workflowType),
)
Span Events
Add timestamped events within spans:
span.AddEvent("validation.started")
// ... validation logic
span.AddEvent("validation.completed", trace.WithAttributes(
    attribute.Int("validation.rules.evaluated", ruleCount),
))
Baggage
Propagate key-value pairs across the entire trace:
// Set baggage
ctx = baggage.ContextWithValues(ctx,
    baggage.String("user.tier", "premium"),
    baggage.String("feature.flag", "new_algorithm"),
)
// Read baggage in any service
if member := baggage.FromContext(ctx).Member("user.tier"); member.Value() == "premium" {
    // Use premium algorithm
}
Performance Considerations
Overhead Analysis
AgentHub’s observability adds:
- CPU: ~5% overhead for tracing
- Memory: ~50MB per service for buffers and metadata
- Network: Minimal (async batched export)
- Latency: ~10ms additional end-to-end latency
Optimization Strategies
- Sampling: Reduce trace volume for high-throughput systems
- Batching: Export spans in batches to reduce network overhead
- Async Processing: Never block business logic for observability
- Resource Limits: Use memory limiters in the collector
Production Recommendations
- Enable sampling for high-volume systems
- Monitor collector performance and scale horizontally if needed
- Set retention policies for traces and metrics
- Use dedicated infrastructure for observability stack
Troubleshooting Common Issues
Missing Traces
Symptoms: No traces appear in Jaeger Causes:
- Context not propagated correctly
- Exporter configuration issues
- Collector connectivity problems
Debugging:
# Check if spans are being created
curl http://localhost:8080/metrics | grep trace
# Check collector logs
docker-compose logs otel-collector
# Verify Jaeger connectivity
curl http://localhost:16686/api/traces
Broken Trace Chains
Symptoms: Spans appear disconnected Causes:
- Context not extracted properly
- New context created instead of continuing existing
Debugging:
// Always check if context contains active span
if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
    fmt.Printf("Active trace: %s\n", span.SpanContext().TraceID())
} else {
    fmt.Println("No active trace context")
}
High Memory Usage
Symptoms: Observability causing OOM errors Causes:
- Too many spans in memory
- Large span attributes
- Export failures causing backlog
Solutions:
// Configure memory limits
config := sdktrace.NewTracerProvider(
    sdktrace.WithSpanLimits(sdktrace.SpanLimits{
        AttributeCountLimit: 128,
        EventCountLimit:     128,
        LinkCountLimit:      128,
    }),
)
The Future of Observability
Emerging Trends
- eBPF-based Observability: Automatic instrumentation without code changes
- AI-Powered Analysis: Automatic anomaly detection and root cause analysis
- Unified Observability: Single pane of glass for metrics, traces, logs, and profiles
- Real-time Alerting: Faster detection and response to issues
OpenTelemetry Roadmap
- Profiling: Continuous profiling integration
- Client-side Observability: Browser and mobile app tracing
- Database Instrumentation: Automatic query tracing
- Infrastructure Correlation: Link application traces to infrastructure metrics
Conclusion
Distributed tracing transforms debugging from guesswork into precise investigation. AgentHub’s implementation with OpenTelemetry provides:
- Complete visibility into event-driven workflows
- Performance insights for optimization
- Error correlation for faster resolution
- Business context through custom attributes
The investment in observability pays dividends in:
- Reduced MTTR (Mean Time To Resolution)
- Improved performance through data-driven optimization
- Better user experience through proactive monitoring
- Team productivity through better tooling
π― Ready to Implement?
Hands-on: Observability Demo Tutorial
Production: Add Observability to Your Agent
Deep Dive: Observability Architecture
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.