Monitoring
Summary
Monitoring combines:
- runtime metrics from train/predict jobs
- drift metrics (data, target, concept proxy)
- aggregated MLflow experiment metrics
Local stack
Start:
make monitoring_up
Stop:
make monitoring_down
Status/logs:
make monitoring_status
make monitoring_logs
Services
- Grafana:
http://localhost:3000(admin/admin) - Prometheus:
http://localhost:9090 - Pushgateway:
http://localhost:9091 - MLflow exporter:
http://localhost:8010/metrics
Provisioned dashboards
Recent Train and Predict RunsOperational HealthMLflow Quality: Current vs HistoryNAB AnalysisMLflow Static OverviewDrift Artifacts
The MLflow quality dashboard now compares latest run metrics against historical context from previous runs, including:
- point metrics (
val_precision,val_recall,val_f1) - changepoint metrics (
val_cp_precision,val_cp_recall,val_cp_f1) - NAB metrics (
val_nab_standard,val_nab_low_fp,val_nab_low_fn) - train/inference drift metrics
- labeled aggregations (
latest,previous_mean,rolling_mean_5,all_mean,best)
MLflow Static Overview provides non-time-based views for all existing runs and
latest-run comparisons, including Pareto-like proxies:
(val_f1 + val_nab_standard) / 2val_f1 / train_duration_seconds
Drift Artifacts surfaces values from model artifact files:
models/<model>/drift_reference.jsonmodels/<model>/drift_report.json
with labels for model_name, dataset_scenario, feature, and metric names.
It also includes a Top 10 Drifted Features by PSI panel (topk(10, ...)).
External monitoring mode
Use external services by setting:
MONITORING_ENABLEDPROMETHEUS_PUSHGATEWAY_URLPROMETHEUS_GROUPING_ENVPROMETHEUS_GROUPING_SERVICE
In this mode, local docker compose is optional.
Troubleshooting empty Grafana panels
If Recent Train and Predict Runs or Operational Health are empty:
- Ensure train/predict jobs were run after monitoring was enabled.
- Verify Pushgateway URL:
PROMETHEUS_PUSHGATEWAY_URL=http://localhost:9091- Check Pushgateway has app metrics:
http://localhost:9091/metricsshould includeanomaly_pipeline_*.