How to Use Grafana Dashboards
How to Use Grafana Dashboards
Goal-oriented guide: Master the AgentHub observability dashboards to monitor, analyze, and troubleshoot your event-driven system effectively.
Prerequisites
- AgentHub observability stack running (docker-compose up -d)
- AgentHub agents running with observability enabled
- Basic understanding of metrics concepts
- 10-15 minutes
Quick Access
- Grafana Dashboard: http://localhost:3333 (admin/admin)
- Direct Dashboard: http://localhost:3333/d/agenthub-eda-dashboard
Dashboard Overview
The AgentHub EDA System Observatory provides comprehensive monitoring across three main areas:
- Event Metrics (Top Row) - Event processing performance
- Distributed Tracing (Middle) - Request flow visualization
- System Health (Bottom Row) - Infrastructure monitoring
Panel-by-Panel Guide
π Event Processing Rate (Top Left)
What it shows: Events processed per second by each service
How to use:
- Monitor throughput: See how many events your system processes
- Identify bottlenecks: Low rates may indicate performance issues
- Compare services: See which agents are busiest
Reading the chart:
Green line: agenthub-broker (150 events/sec)
Blue line:  agenthub-publisher (50 events/sec)
Red line:   agenthub-subscriber (145 events/sec)
Troubleshooting:
- Flat lines: No activity - check if agents are running
- Dropping rates: Performance degradation - check CPU/memory
- Spiky patterns: Bursty workloads - consider load balancing
π¨ Event Processing Error Rate (Top Right)
What it shows: Percentage of events that failed processing
How to use:
- Monitor reliability: Should stay below 5% (green zone)
- Alert threshold: Yellow above 5%, red above 10%
- Quick health check: Single glance system reliability
Color coding:
- Green (0-5%): Healthy system
- Yellow (5-10%): Moderate issues
- Red (>10%): Critical problems
Troubleshooting:
- High error rates: Check Jaeger for failing traces
- Sudden spikes: Look for recent deployments or config changes
- Persistent errors: Check logs for recurring issues
π Event Types Distribution (Middle Left)
What it shows: Breakdown of event types by volume
How to use:
- Understand workload: See what types of tasks dominate
- Capacity planning: Identify which task types need scaling
- Anomaly detection: Unusual distributions may indicate issues
Example interpretation:
greeting: 40% (blue) - Most common task type
math_calculation: 35% (green) - Heavy computational tasks
random_number: 20% (yellow) - Quick tasks
unknown_task: 5% (red) - Error-generating tasks
Troubleshooting:
- Missing task types: Check if specific agents are down
- Unexpected distributions: May indicate upstream issues
- Dominant error types: Focus optimization efforts
β±οΈ Event Processing Latency (Middle Right)
What it shows: Processing time percentiles (p50, p95, p99)
How to use:
- Performance monitoring: Track how fast events are processed
- SLA compliance: Ensure latencies meet requirements
- Outlier detection: p99 shows worst-case scenarios
Understanding percentiles:
- p50 (median): 50% of events process faster than this
- p95: 95% of events process faster than this
- p99: 99% of events process faster than this
Healthy ranges:
- p50: < 50ms (very responsive)
- p95: < 200ms (good performance)
- p99: < 500ms (acceptable outliers)
Troubleshooting:
- Rising latencies: Check CPU/memory usage
- High p99: Look for resource contention or long-running tasks
- Flatlined metrics: May indicate measurement issues
π Distributed Traces (Middle Section)
What it shows: Integration with Jaeger for trace visualization
How to use:
- Click “Explore” to open Jaeger
- Select service from dropdown
- Find specific traces to debug issues
- Analyze request flows across services
When to use:
- Debugging errors: Find root cause of failures
- Performance analysis: Identify slow operations
- Understanding flows: See complete request journeys
π₯οΈ Service CPU Usage (Bottom Left)
What it shows: CPU utilization by service
How to use:
- Capacity monitoring: Ensure services aren’t overloaded
- Resource planning: Identify when to scale
- Performance correlation: High CPU often explains high latency
Healthy ranges:
- < 50%: Comfortable utilization
- 50-70%: Moderate load
- > 70%: Consider scaling
πΎ Service Memory Usage (Bottom Center)
What it shows: Memory consumption by service
How to use:
- Memory leak detection: Watch for continuously growing usage
- Capacity planning: Ensure sufficient memory allocation
- Garbage collection: High usage may impact performance
Monitoring tips:
- Steady growth: May indicate memory leaks
- Sawtooth pattern: Normal GC behavior
- Sudden spikes: Check for large event batches
π§΅ Go Goroutines (Bottom Right)
What it shows: Number of concurrent goroutines per service
How to use:
- Concurrency monitoring: Track parallel processing
- Resource leak detection: Continuously growing numbers indicate leaks
- Performance tuning: Optimize concurrency levels
Normal patterns:
- Stable baseline: Normal operation
- Activity spikes: During high load
- Continuous growth: Potential goroutine leaks
π₯ Service Health Status (Bottom Far Right)
What it shows: Up/down status of each service
How to use:
- Quick status check: See if all services are running
- Outage detection: Immediately identify down services
- Health monitoring: Green = UP, Red = DOWN
Dashboard Variables and Filters
Service Filter
Location: Top of dashboard Purpose: Filter metrics by specific services Usage:
- Select “All” to see everything
- Choose specific services to focus analysis
- Useful for isolating problems to specific components
Event Type Filter
Location: Top of dashboard Purpose: Filter by event/task types Usage:
- Analyze specific workflow types
- Debug particular task categories
- Compare performance across task types
Time Range Selector
Location: Top right of dashboard Purpose: Control time window for analysis Common ranges:
- 5 minutes: Real-time monitoring
- 1 hour: Recent trend analysis
- 24 hours: Daily pattern analysis
- 7 days: Weekly trend and capacity planning
Advanced Usage Patterns
Performance Investigation Workflow
- Start with Overview: - Check error rates (should be < 5%)
- Verify processing rates look normal
- Scan for any red/yellow indicators
 
- Drill Down on Issues: - If high error rates β check distributed traces
- If high latency β examine CPU/memory usage
- If low throughput β check service health
 
- Root Cause Analysis: - Use time range selector to find when problems started
- Filter by specific services to isolate issues
- Correlate metrics across different panels
 
Capacity Planning Workflow
- Analyze Peak Patterns: - Set time range to 7 days
- Identify peak usage periods
- Note maximum throughput achieved
 
- Resource Utilization: - Check CPU usage during peaks
- Monitor memory consumption trends
- Verify goroutine scaling behavior
 
- Plan Scaling: - If CPU > 70% during peaks, scale up
- If memory continuously growing, investigate leaks
- If error rates spike during load, optimize before scaling
 
Troubleshooting Workflow
- Identify Symptoms: - High error rates: Focus on traces and logs
- High latency: Check resource utilization
- Low throughput: Verify service health
 
- Time Correlation: - Use time range to find when issues started
- Look for correlated changes across metrics
- Check for deployment or configuration changes
 
- Service Isolation: - Use service filter to identify problematic components
- Compare healthy vs unhealthy services
- Check inter-service dependencies
 
Dashboard Customization
Adding New Panels
- Click “+ Add panel” in top menu
- Choose visualization type:- Time series for trends
- Stat for current values
- Gauge for thresholds
 
- Configure query:# Example: Custom error rate rate(my_custom_errors_total[5m]) / rate(my_custom_requests_total[5m]) * 100
Creating Alerts
- Edit existing panel or create new one
- Click “Alert” tab
- Configure conditions:Query: rate(event_errors_total[5m]) / rate(events_processed_total[5m]) * 100 Condition: IS ABOVE 5 Evaluation: Every 1m for 2m
- Set notification channels
Custom Time Ranges
- Click time picker (top right)
- Select “Custom range”
- Set specific dates/times for historical analysis
- Use “Refresh” settings for auto-updating
Troubleshooting Dashboard Issues
Dashboard Not Loading
# Check Grafana status
docker-compose ps grafana
# Check Grafana logs
docker-compose logs grafana
# Restart if needed
docker-compose restart grafana
No Data in Panels
# Check Prometheus connection
curl http://localhost:9090/api/v1/targets
# Verify agents are exposing metrics
curl http://localhost:8080/metrics
curl http://localhost:8081/metrics
curl http://localhost:8082/metrics
# Check Prometheus configuration
docker-compose logs prometheus
Slow Dashboard Performance
- Reduce time range: Use shorter windows for better performance
- Limit service selection: Filter to specific services
- Optimize queries: Use appropriate rate intervals
- Check resource usage: Ensure Prometheus has enough memory
Authentication Issues
- Default credentials: admin/admin
- Reset password: Through Grafana UI after first login
- Lost access: Restart Grafana container to reset
Best Practices
Regular Monitoring
- Check dashboard daily: Quick health overview
- Weekly reviews: Trend analysis and capacity planning
- Set up alerts: Proactive monitoring for critical metrics
Performance Optimization
- Use appropriate time ranges: Don’t query more data than needed
- Filter effectively: Use service and event type filters
- Refresh intervals: Balance real-time needs with performance
Team Usage
- Share dashboard URLs: Bookmark specific views
- Create annotations: Mark deployments and incidents
- Export snapshots: Share findings with team members
Integration with Other Tools
Jaeger Integration
- Click Explore in traces panel
- Auto-links to Jaeger with service context
- Correlate traces with metrics timeframes
Prometheus Integration
- Click Explore on any panel
- Edit queries in Prometheus query language
- Access raw metrics for custom analysis
Log Correlation
- Use trace IDs from Jaeger
- Search logs for matching trace IDs
- Correlate log events with metric spikes
π― Next Steps:
Deep Debugging: Debug with Distributed Tracing
Production Setup: Configure Alerts
Understanding: Observability Architecture Explained
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.