md-evals — Evaluate AI Agent Skills with Confidence

Features

Everything you need to evaluate AI skills

From multi-dimensional rubrics to full CI pipelines — md-evals gives you the tools to measure, track, and improve AI agent performance.

Multi-Dimensional Scoring

Evaluate skills across 7 quality dimensions with configurable YAML rubrics. Letter grades from S to F give instant clarity.

Correctness Completeness Format Adherence Safety Efficiency Robustness

Triple-Model Pipeline

Three-stage evaluation: Auditor analyzes, Target executes, Judge scores. Use different models per stage for unbiased results.

Auditor Target Judge

Probe & Detector Plugins

Built-in probes for dimensions, edge cases, compliance, and Gherkin scenarios. Extend with your own via Python entry_points.

DimensionProbe EdgeCaseProbe ComplianceProbe GherkinProbe

Trust & Verification

Citation validation ensures the LLM references specific lines. Gherkin-like eval scenarios define precise Given/When/Then checks.

Citations Given/When/Then Line refs

Ship & Scale

Run eval suites in CI with proper exit codes. Generate static HTML reports. Evaluate entire plugin directories at once.

CI exit codes HTML reports Plugin dirs

Analytics Dashboard

Track score trends over time, monitor cost per model, and explore skills × dimensions with interactive heatmaps.

Trends Cost tracking Heatmap

Command Line Interface

Powerful CLI, simple commands

Pre-check skills, run full pipeline evaluations, execute test suites, and track analytics — all from your terminal. Perfect for CI/CD integration.

Get Started View Source

md-evals — terminal

$ md-evals check SKILL.md ✓ SKILL.md — Pre-check PASSED (12 checks, 0 findings) $ md-evals run --pipeline --probe edge-case,gherkin Pipeline: PreCheck → Auditor → Target → Judge Overall Grade: A (0.87) $ md-evals suite run --config suite.yaml backend-skills: 5/5 passed ✓ $ md-evals analytics trends --skill react-19 Trend: improving ↑ (0.72 → 0.87 over 30 days)

Quick Start

Up and running in 60 seconds

Install, initialize, check, and evaluate — four commands to your first AI skill evaluation.

1

Install

Install from PyPI

pip install md-evals

2

Initialize

Create config files

md-evals init

3

Pre-check

Validate your skill file

md-evals check SKILL.md

4

Evaluate

Run the full pipeline

md-evals run --pipeline

quick-start

$ pip install md-evals Successfully installed md-evals-0.1.0 $ md-evals init ✓ Created eval.yaml ✓ Created rubric.yaml $ md-evals check SKILL.md ✓ SKILL.md — Pre-check PASSED $ md-evals run --config eval.yaml --pipeline Pipeline: PreCheck → Auditor → Target → Judge ████████████████████ 100% Overall Grade: A (0.87) — PASSED