Private
Public Access
1
0
Files
data-api/plans/grafana-time-range-api.md
Ask Bjørn Hansen 8262b1442f feat(api): add Grafana time range endpoint for scores
- Add /api/v2/server/scores/{server}/{mode} endpoint
- Support time range queries with from/to parameters
- Return data in Grafana table format for visualization
- Fix routing pattern to handle IP addresses correctly
- Add comprehensive parameter validation and error handling
2025-07-27 02:18:32 -07:00

15 KiB

DETAILED IMPLEMENTATION PLAN: Grafana Time Range API with Future Downsampling Support

Overview

Implement a new Grafana-compatible API endpoint /api/v2/server/scores/{server}/{mode} that returns time series data in Grafana format with time range support and future downsampling capabilities.

API Specification

Endpoint

  • URL: /api/v2/server/scores/{server}/{mode}
  • Method: GET
  • Path Parameters:
    • server: Server IP address or ID (same validation as existing API)
    • mode: Only json supported initially

Query Parameters (following Grafana conventions)

  • from: Unix timestamp in seconds (required)
  • to: Unix timestamp in seconds (required)
  • maxDataPoints: Integer, default 50000, max 50000 (for future downsampling)
  • monitor: Monitor ID, name prefix, or "*" for all (optional, same as existing)
  • interval: Future downsampling interval like "1m", "5m", "1h" (optional, not implemented initially)

Response Format

Grafana table format JSON array (more efficient than separate series):

[
  {
    "target": "monitor{name=zakim1-yfhw4a}",
    "tags": {
      "monitor_id": "126",
      "monitor_name": "zakim1-yfhw4a",
      "type": "monitor",
      "status": "active"
    },
    "columns": [
      {"text": "time", "type": "time"},
      {"text": "score", "type": "number"},
      {"text": "rtt", "type": "number", "unit": "ms"},
      {"text": "offset", "type": "number", "unit": "s"}
    ],
    "values": [
      [1753431667000, 20.0, 18.865, -0.000267],
      [1753431419000, 20.0, 18.96, -0.000390],
      [1753431151000, 20.0, 18.073, -0.000768],
      [1753430063000, 20.0, 18.209, null]
    ]
  }
]

Implementation Details

1. Server Routing (server/server.go)

Add new route after existing scores routes:

e.GET("/api/v2/server/scores/:server/:mode", srv.scoresTimeRange)

Note: Initially attempted :server.:mode pattern, but Echo router cannot properly parse IP addresses with dots using this pattern. Changed to :server/:mode to match existing API pattern and ensure compatibility with IP addresses like 23.155.40.38.

Key Implementation Clarifications

Monitor Filtering Behavior

  • monitor=*: Return ALL monitors (no monitor count limit)
  • 50k datapoint limit: Applied in database query (LIMIT clause)
  • Return whatever data we get from database to user (no post-processing truncation)

Null Value Handling Strategy

  • Score: Always include (should never be null)
  • RTT: Skip datapoints where RTT is null
  • Offset: Skip datapoints where offset is null

Time Range Validation Rules

  • Zero duration: Return 400 Bad Request
  • Future timestamps: Allow for now
  • Minimum range: 1 second
  • Maximum range: 90 days

2. New Handler Function (server/grafana.go)

Function Signature

func (srv *Server) scoresTimeRange(c echo.Context) error

Parameter Parsing & Validation

// Extend existing historyParameters struct for time range support
type timeRangeParams struct {
    historyParameters // embed existing struct
    from              time.Time  
    to                time.Time
    maxDataPoints     int
    interval          string // for future downsampling
}

func (srv *Server) parseTimeRangeParams(ctx context.Context, c echo.Context) (timeRangeParams, error) {
    // Start with existing parameter parsing logic
    baseParams, err := srv.getHistoryParameters(ctx, c)
    if err != nil {
        return timeRangeParams{}, err
    }
    
    // Parse and validate from/to second timestamps
    // Validate time range (max 90 days, min 1 second)
    // Parse maxDataPoints (default 50000, max 50000)
    // Return extended parameters
}

Response Structure

type ColumnDef struct {
    Text string `json:"text"`
    Type string `json:"type"`
    Unit string `json:"unit,omitempty"`
}

type GrafanaTableSeries struct {
    Target  string            `json:"target"`
    Tags    map[string]string `json:"tags"`
    Columns []ColumnDef       `json:"columns"`
    Values  [][]interface{}   `json:"values"`
}

type GrafanaTimeSeriesResponse []GrafanaTableSeries

Cache Control

// Reuse existing setHistoryCacheControl function for consistency
// Logic based on data recency and entry count:
// - Empty or >8h old data: "s-maxage=260,max-age=360"
// - Single entry: "s-maxage=60,max-age=35" 
// - Multiple entries: "s-maxage=90,max-age=120"
setHistoryCacheControl(c, history)

3. ClickHouse Data Access (chdb/logscores.go)

New Method

func (d *ClickHouse) LogscoresTimeRange(ctx context.Context, serverID, monitorID int, from, to time.Time, limit int) ([]ntpdb.LogScore, error) {
    // Build query with time range WHERE clause
    // Always order by ts ASC (Grafana convention)
    // Apply limit to prevent memory issues
    // Use same row scanning logic as existing Logscores method
}

Query Structure

SELECT id, monitor_id, server_id, ts,
       toFloat64(score), toFloat64(step), offset,
       rtt, leap, warning, error
FROM log_scores  
WHERE server_id = ?
  AND ts >= ?
  AND ts <= ?
  [AND monitor_id = ?]  -- if specific monitor requested
ORDER BY ts ASC
LIMIT ?

4. Data Transformation Logic (server/grafana.go)

Core Transformation Function

func transformToGrafanaTableFormat(history *logscores.LogScoreHistory, monitors []ntpdb.Monitor) GrafanaTimeSeriesResponse {
    // Group data by monitor_id (one series per monitor)
    // Create table format with columns: time, score, rtt, offset
    // Convert timestamps to milliseconds
    // Build proper target names and tags
    // Handle null values appropriately in table values
}

Grouping Strategy

  1. Group by Monitor: One table series per monitor
  2. Table Columns: time, score, rtt, offset (all metrics in one table)
  3. Target Naming: monitor{name={sanitized_monitor_name}}
  4. Tag Structure: Include monitor metadata (no metric type needed)
  5. Monitor Status: Query real monitor data using q.GetServerScores() like existing API
  6. Series Ordering: No guaranteed order (standard Grafana behavior)
  7. Efficiency: More efficient than separate series - less JSON overhead

Timestamp Conversion

timestampMs := logScore.Ts.Unix() * 1000

5. Error Handling

Validation Errors (400 Bad Request)

  • Invalid timestamp format
  • from >= to (including zero duration)
  • Time range too large (> 90 days)
  • Time range too small (< 1 second minimum)
  • maxDataPoints > 50000
  • Invalid mode (not "json")

Not Found Errors (404)

  • Server not found
  • Monitor not found
  • Server deleted

Server Errors (500)

  • ClickHouse connection issues
  • Database query errors

6. Future Downsampling Design

API Extension Points

  • interval parameter parsing ready
  • maxDataPoints limit already enforced
  • Response format supports downsampled data seamlessly

Downsampling Algorithm (Future Implementation)

// When datapoints > maxDataPoints:
// 1. Calculate downsample interval: (to - from) / maxDataPoints
// 2. Group data into time buckets  
// 3. Aggregate per bucket: avg for score/rtt, last for offset
// 4. Return aggregated datapoints

Testing Strategy

Unit Tests

  • Parameter parsing and validation
  • Data transformation logic
  • Error handling scenarios
  • Timestamp conversion accuracy

Integration Tests

  • End-to-end API requests
  • ClickHouse query execution
  • Multiple monitor scenarios
  • Large time range handling

Manual Testing

  • Grafana integration testing
  • Performance with various time ranges
  • Cache behavior validation

Performance Considerations

Current Implementation

  • 50k datapoint limit applied in database query (LIMIT clause) (covers ~few weeks of data)
  • ClickHouse-only for better range query performance
  • Proper indexing on (server_id, ts) assumed
  • Table format more efficient than separate time series (less JSON overhead)

Future Optimizations (Critical for Production)

  • Downsampling for large ranges: Essential for 90-day queries with reasonable performance
  • Query optimization based on range size
  • Potential parallel monitor queries
  • Adaptive sampling rates based on time range duration

Documentation Updates

API.md Addition

### 7. Server Scores Time Range (v2)

**GET** `/api/v2/server/scores/{server}/{mode}`

Grafana-compatible time series endpoint for NTP server scoring data.

#### Path Parameters
- `server`: Server IP address or ID
- `mode`: Response format (`json` only)

#### Query Parameters  
- `from`: Start time as Unix timestamp in seconds (required)
- `to`: End time as Unix timestamp in seconds (required)
- `maxDataPoints`: Maximum data points to return (default: 50000, max: 50000)
- `monitor`: Monitor filter (ID, name prefix, or "*" for all)

#### Response Format
Grafana table format array with one series per monitor containing all metrics as columns.

Key Research Findings

Grafana Error Format Requirements

  • HTTP Status Codes: Standard 400/404/500 work fine
  • Response Body: JSON preferred with Content-Type: application/json
  • Structure: Simple {"error": "message", "status": code} is sufficient
  • Compatibility: Existing Echo error patterns are Grafana-compatible

Data Volume Considerations

  • 50k Datapoint Limit: Only covers ~few weeks of data, not sufficient for 90-day ranges
  • Downsampling Critical: Required for production use with 90-day time ranges
  • Current Approach: Acceptable for MVP, downsampling essential for full utility

Implementation Checklist

Phase 0: Grafana Table Format Validation COMPLETED

  • Add test endpoint /api/v2/test/grafana-table returning sample table format
  • Implement Grafana table format response structures in server/grafana.go
  • Add structured logging and OpenTelemetry tracing to test endpoint
  • Verify endpoint compiles and serves correct JSON format
  • Test endpoint response format and headers (CORS, Content-Type, Cache-Control)
  • Test with actual Grafana instance to validate table format compatibility
  • Confirm time series panels render table format correctly
  • Validate column types and units display properly

Phase 0 Implementation Details

Files Created/Modified:

  • server/grafana.go: New file containing Grafana table format structures and test endpoint
  • server/server.go: Added route e.GET("/api/v2/test/grafana-table", srv.testGrafanaTable)

Test Endpoint Features:

  • URL: http://localhost:8030/api/v2/test/grafana-table
  • Response Format: Grafana table format with realistic NTP Pool data
  • Sample Data: Two monitor series (zakim1-yfhw4a, nj2-mon01) with time-based values
  • Columns: time, score, rtt (ms), offset (s) with proper units
  • Null Handling: Demonstrates null offset values
  • Headers: CORS, JSON content-type, cache control
  • Observability: Structured logging with context, OpenTelemetry tracing

Recommended Grafana Data Source: JSON API plugin (marcusolsson-json-datasource) - ideal for REST APIs returning table format JSON

Phase 1: Core Implementation COMPLETED

  • Add route in server.go (fixed routing pattern from :server.:mode to :server/:mode)
  • Implement parseTimeRangeParams function for parameter validation
  • Add LogscoresTimeRange method to ClickHouse with time range filtering
  • Implement transformToGrafanaTableFormat function with monitor grouping
  • Add scoresTimeRange handler with full error handling
  • Error handling and validation (reuse existing Echo patterns)
  • Cache control headers (reuse setHistoryCacheControl)

Phase 1 Implementation Details

Key Components Built:

  • Route Pattern: /api/v2/server/scores/:server/:mode (matches existing API consistency)
  • Parameter Validation: Full validation of from/to timestamps, maxDataPoints, time ranges
  • ClickHouse Integration: LogscoresTimeRange() with time-based WHERE clauses and ASC ordering
  • Data Transformation: Grafana table format with monitor grouping and null value handling
  • Complete Handler: scoresTimeRange() with server validation, error handling, caching, and CORS

Routing Fix: Changed from :server.:mode to :server/:mode to resolve Echo router issue with IP addresses containing dots (e.g., 23.155.40.38).

Files Created/Modified in Phase 1:

  • server/grafana.go: Complete implementation with all structures and functions
    • timeRangeParams struct and parseTimeRangeParams() function
    • transformToGrafanaTableFormat() function with monitor grouping
    • scoresTimeRange() handler with full error handling
    • sanitizeMonitorName() utility function
  • server/server.go: Added route e.GET("/api/v2/server/scores/:server/:mode", srv.scoresTimeRange)
  • chdb/logscores.go: Added LogscoresTimeRange() method for time-based queries

Production Testing Results (July 25, 2025):

  • Real Data Verification: Successfully tested with server 102.64.112.164 over 12-hour time range
  • Multiple Monitor Support: Returns data for multiple monitors (defra1-210hw9t, recentmedian)
  • Data Quality Validation:
    • RTT conversion (microseconds → milliseconds): Working
    • Timestamp conversion (seconds → milliseconds): Working
    • Null value handling: Working (recentmedian has null RTT/offset as expected)
    • Monitor grouping: Working (one series per monitor)
  • API Parameter Changes: Successfully changed from milliseconds to seconds for user-friendliness
  • Volume Testing: Handles 100+ data points per monitor efficiently
  • Error Handling: All validation working (400 for invalid params, 404 for missing servers)
  • Performance: Sub-second response times for 12-hour ranges

Sample Working Request:

curl 'http://localhost:8030/api/v2/server/scores/102.64.112.164/json?from=1753457764&to=1753500964&monitor=*'

Phase 2: Testing & Polish

  • Unit tests for all functions
  • Integration tests
  • Manual Grafana testing with real data
  • Performance testing with large ranges (up to 50k points)
  • API documentation updates

Phase 3: Future Enhancement Ready

  • Interval parameter parsing (no-op initially)
  • Downsampling framework hooks (critical for 90-day ranges)
  • Monitoring and metrics for new endpoint

This design provides a solid foundation for immediate Grafana integration while being fully prepared for future downsampling capabilities without breaking changes.

Critical Notes for Production

  • Downsampling Required: 50k datapoint limit means 90-day ranges will hit limits quickly
  • Table Format Validation: Phase 0 testing ensures Grafana compatibility before full implementation
  • Error Handling: Existing Echo patterns are sufficient for Grafana requirements
  • Scalability: Current design handles weeks of data well, downsampling needed for months