Private

Public Access

Files

Ask Bjørn Hansen d4bf8d9e16 feat(api): add Grafana test endpoint for table format

Add `/api/v2/test/grafana-table` endpoint to validate Grafana
table format compatibility before implementing the full time
range API.

- Create server/grafana.go with table format structures
- Add structured logging and OpenTelemetry tracing
- Include realistic NTP Pool sample data with null handling
- Set proper CORS and cache headers for testing
- Update implementation plan with Phase 0 completion status

Ready for Grafana JSON API data source integration testing.

2025-07-26 09:03:46 -07:00

12 KiB

Raw Blame History

DETAILED IMPLEMENTATION PLAN: Grafana Time Range API with Future Downsampling Support

Overview

Implement a new Grafana-compatible API endpoint /api/v2/server/scores/{server}.{mode} that returns time series data in Grafana format with time range support and future downsampling capabilities.

API Specification

Endpoint

URL: /api/v2/server/scores/{server}.{mode}
Method: GET
Path Parameters:
- server: Server IP address or ID (same validation as existing API)
- mode: Only json supported initially

Query Parameters (following Grafana conventions)

from: Unix timestamp in milliseconds (required)
to: Unix timestamp in milliseconds (required)
maxDataPoints: Integer, default 50000, max 50000 (for future downsampling)
monitor: Monitor ID, name prefix, or "*" for all (optional, same as existing)
interval: Future downsampling interval like "1m", "5m", "1h" (optional, not implemented initially)

Response Format

Grafana table format JSON array (more efficient than separate series):

[
  {
    "target": "monitor{name=zakim1-yfhw4a}",
    "tags": {
      "monitor_id": "126",
      "monitor_name": "zakim1-yfhw4a",
      "type": "monitor",
      "status": "active"
    },
    "columns": [
      {"text": "time", "type": "time"},
      {"text": "score", "type": "number"},
      {"text": "rtt", "type": "number", "unit": "ms"},
      {"text": "offset", "type": "number", "unit": "s"}
    ],
    "values": [
      [1753431667000, 20.0, 18.865, -0.000267],
      [1753431419000, 20.0, 18.96, -0.000390],
      [1753431151000, 20.0, 18.073, -0.000768],
      [1753430063000, 20.0, 18.209, null]
    ]
  }
]

Implementation Details

1. Server Routing (`server/server.go`)

Add new route after existing scores routes:

e.GET("/api/v2/server/scores/:server.:mode", srv.scoresTimeRange)

Key Implementation Clarifications

Monitor Filtering Behavior

monitor=*: Return ALL monitors (no monitor count limit)
50k datapoint limit: Applied in database query (LIMIT clause)
Return whatever data we get from database to user (no post-processing truncation)

Null Value Handling Strategy

Score: Always include (should never be null)
RTT: Skip datapoints where RTT is null
Offset: Skip datapoints where offset is null

Time Range Validation Rules

Zero duration: Return 400 Bad Request
Future timestamps: Allow for now
Minimum range: 1 second
Maximum range: 90 days

2. New Handler Function (`server/history.go`)

Function Signature

func (srv *Server) scoresTimeRange(c echo.Context) error

Parameter Parsing & Validation

// Extend existing historyParameters struct for time range support
type timeRangeParams struct {
    historyParameters // embed existing struct
    from              time.Time  
    to                time.Time
    maxDataPoints     int
    interval          string // for future downsampling
}

func (srv *Server) parseTimeRangeParams(ctx context.Context, c echo.Context) (timeRangeParams, error) {
    // Start with existing parameter parsing logic
    baseParams, err := srv.getHistoryParameters(ctx, c)
    if err != nil {
        return timeRangeParams{}, err
    }
    
    // Parse and validate from/to millisecond timestamps
    // Validate time range (max 90 days, min 1 second)
    // Parse maxDataPoints (default 50000, max 50000)
    // Return extended parameters
}

Response Structure

type ColumnDef struct {
    Text string `json:"text"`
    Type string `json:"type"`
    Unit string `json:"unit,omitempty"`
}

type GrafanaTableSeries struct {
    Target  string            `json:"target"`
    Tags    map[string]string `json:"tags"`
    Columns []ColumnDef       `json:"columns"`
    Values  [][]interface{}   `json:"values"`
}

type GrafanaTimeSeriesResponse []GrafanaTableSeries

Cache Control

// Reuse existing setHistoryCacheControl function for consistency
// Logic based on data recency and entry count:
// - Empty or >8h old data: "s-maxage=260,max-age=360"
// - Single entry: "s-maxage=60,max-age=35" 
// - Multiple entries: "s-maxage=90,max-age=120"
setHistoryCacheControl(c, history)

3. ClickHouse Data Access (`chdb/logscores.go`)

New Method

func (d *ClickHouse) LogscoresTimeRange(ctx context.Context, serverID, monitorID int, from, to time.Time, limit int) ([]ntpdb.LogScore, error) {
    // Build query with time range WHERE clause
    // Always order by ts ASC (Grafana convention)
    // Apply limit to prevent memory issues
    // Use same row scanning logic as existing Logscores method
}

Query Structure

SELECT id, monitor_id, server_id, ts,
       toFloat64(score), toFloat64(step), offset,
       rtt, leap, warning, error
FROM log_scores  
WHERE server_id = ?
  AND ts >= ?
  AND ts <= ?
  [AND monitor_id = ?]  -- if specific monitor requested
ORDER BY ts ASC
LIMIT ?

4. Data Transformation Logic (`server/history.go`)

Core Transformation Function

func transformToGrafanaTableFormat(history *logscores.LogScoreHistory, monitors []ntpdb.Monitor) GrafanaTimeSeriesResponse {
    // Group data by monitor_id (one series per monitor)
    // Create table format with columns: time, score, rtt, offset
    // Convert timestamps to milliseconds
    // Build proper target names and tags
    // Handle null values appropriately in table values
}

Grouping Strategy

Group by Monitor: One table series per monitor
Table Columns: time, score, rtt, offset (all metrics in one table)
Target Naming: monitor{name={sanitized_monitor_name}}
Tag Structure: Include monitor metadata (no metric type needed)
Monitor Status: Query real monitor data using q.GetServerScores() like existing API
Series Ordering: No guaranteed order (standard Grafana behavior)
Efficiency: More efficient than separate series - less JSON overhead

Timestamp Conversion

timestampMs := logScore.Ts.Unix() * 1000

5. Error Handling

Validation Errors (400 Bad Request)

Invalid timestamp format
from >= to (including zero duration)
Time range too large (> 90 days)
Time range too small (< 1 second minimum)
maxDataPoints > 50000
Invalid mode (not "json")

Not Found Errors (404)

Server not found
Monitor not found
Server deleted

Server Errors (500)

ClickHouse connection issues
Database query errors

6. Future Downsampling Design

API Extension Points

interval parameter parsing ready
maxDataPoints limit already enforced
Response format supports downsampled data seamlessly

Downsampling Algorithm (Future Implementation)

// When datapoints > maxDataPoints:
// 1. Calculate downsample interval: (to - from) / maxDataPoints
// 2. Group data into time buckets  
// 3. Aggregate per bucket: avg for score/rtt, last for offset
// 4. Return aggregated datapoints

Testing Strategy

Unit Tests

Parameter parsing and validation
Data transformation logic
Error handling scenarios
Timestamp conversion accuracy

Integration Tests

End-to-end API requests
ClickHouse query execution
Multiple monitor scenarios
Large time range handling

Manual Testing

Grafana integration testing
Performance with various time ranges
Cache behavior validation

Performance Considerations

Current Implementation

50k datapoint limit applied in database query (LIMIT clause) (covers ~few weeks of data)
ClickHouse-only for better range query performance
Proper indexing on (server_id, ts) assumed
Table format more efficient than separate time series (less JSON overhead)

Future Optimizations (Critical for Production)

Downsampling for large ranges: Essential for 90-day queries with reasonable performance
Query optimization based on range size
Potential parallel monitor queries
Adaptive sampling rates based on time range duration

Documentation Updates

API.md Addition

### 7. Server Scores Time Range (v2)

**GET** `/api/v2/server/scores/{server}.{mode}`

Grafana-compatible time series endpoint for NTP server scoring data.

#### Path Parameters
- `server`: Server IP address or ID
- `mode`: Response format (`json` only)

#### Query Parameters  
- `from`: Start time as Unix timestamp in milliseconds (required)
- `to`: End time as Unix timestamp in milliseconds (required)
- `maxDataPoints`: Maximum data points to return (default: 50000, max: 50000)
- `monitor`: Monitor filter (ID, name prefix, or "*" for all)

#### Response Format
Grafana table format array with one series per monitor containing all metrics as columns.

Key Research Findings

Grafana Error Format Requirements

HTTP Status Codes: Standard 400/404/500 work fine
Response Body: JSON preferred with Content-Type: application/json
Structure: Simple {"error": "message", "status": code} is sufficient
Compatibility: Existing Echo error patterns are Grafana-compatible

Data Volume Considerations

50k Datapoint Limit: Only covers ~few weeks of data, not sufficient for 90-day ranges
Downsampling Critical: Required for production use with 90-day time ranges
Current Approach: Acceptable for MVP, downsampling essential for full utility

Implementation Checklist

Phase 0: Grafana Table Format Validation ✅ COMPLETED

Add test endpoint /api/v2/test/grafana-table returning sample table format
Implement Grafana table format response structures in server/grafana.go
Add structured logging and OpenTelemetry tracing to test endpoint
Verify endpoint compiles and serves correct JSON format
Test endpoint response format and headers (CORS, Content-Type, Cache-Control)
Test with actual Grafana instance to validate table format compatibility
Confirm time series panels render table format correctly
Validate column types and units display properly

Phase 0 Implementation Details

Files Created/Modified:

server/grafana.go: New file containing Grafana table format structures and test endpoint
server/server.go: Added route e.GET("/api/v2/test/grafana-table", srv.testGrafanaTable)

Test Endpoint Features:

URL: http://localhost:8030/api/v2/test/grafana-table
Response Format: Grafana table format with realistic NTP Pool data
Sample Data: Two monitor series (zakim1-yfhw4a, nj2-mon01) with time-based values
Columns: time, score, rtt (ms), offset (s) with proper units
Null Handling: Demonstrates null offset values
Headers: CORS, JSON content-type, cache control
Observability: Structured logging with context, OpenTelemetry tracing

Recommended Grafana Data Source: JSON API plugin (marcusolsson-json-datasource) - ideal for REST APIs returning table format JSON

Phase 1: Core Implementation

Add route in server.go
Implement parseTimeRangeParams function
Add LogscoresTimeRange method to ClickHouse
Implement transformToGrafanaTableFormat function
Add scoresTimeRange handler
Error handling and validation (reuse existing Echo patterns)
Cache control headers (reuse setHistoryCacheControl)

Phase 2: Testing & Polish

Unit tests for all functions
Integration tests
Manual Grafana testing with real data
Performance testing with large ranges (up to 50k points)
API documentation updates

Phase 3: Future Enhancement Ready

Interval parameter parsing (no-op initially)
Downsampling framework hooks (critical for 90-day ranges)
Monitoring and metrics for new endpoint

This design provides a solid foundation for immediate Grafana integration while being fully prepared for future downsampling capabilities without breaking changes.

Critical Notes for Production

Downsampling Required: 50k datapoint limit means 90-day ranges will hit limits quickly
Table Format Validation: Phase 0 testing ensures Grafana compatibility before full implementation
Error Handling: Existing Echo patterns are sufficient for Grafana requirements
Scalability: Current design handles weeks of data well, downsampling needed for months

12 KiB Raw Blame History