API Documentation

Everything you need to integrate EvalKit into your application.

Quickstart

Get up and running with EvalKit in three steps.

Step 1: Create an account

Sign up at Sign Up and get your API key from the dashboard. Your API key will look like ek_live_abc123...

Step 2: Make your first eval

Send a POST request to the eval endpoint with your LLM output and the criteria you want to evaluate against.

curl -X POST https://evalkit.dev/api/v1/eval \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "output": "The mitochondria is the powerhouse of the cell.",
    "input": "What is the mitochondria?",
    "criteria": ["accuracy", "relevance", "completeness"]
  }'

Step 3: Check your results

The API returns a structured response with an overall score, per-criteria breakdowns, any detected issues, and actionable suggestions.

Response
{
  "id": "eval_abc123",
  "overall_score": 0.95,
  "criteria": {
    "accuracy": { "score": 0.97, "reasoning": "Factually correct statement" },
    "relevance": { "score": 0.94, "reasoning": "Directly answers the question" },
    "completeness": { "score": 0.93, "reasoning": "Covers the main function" }
  },
  "issues": [],
  "suggestions": [
    "Consider expanding on the role of mitochondria in ATP synthesis."
  ],
  "tokens_used": 142,
  "latency_ms": 830
}

overall_score — Weighted average across all criteria (0 to 1).

criteria — Individual score for each criterion you requested.

issues — Array of problems detected in the output.

suggestions — Actionable recommendations to improve the output.

tokens_used — Number of tokens consumed by the evaluation.

latency_ms — Time taken to process the evaluation in milliseconds.

API Reference

Complete reference for all EvalKit API endpoints.

POST/v1/eval

Run a single evaluation against one or more criteria.

Headers

NameTypeRequiredDescription
AuthorizationstringYesBearer token. Format: "Bearer YOUR_API_KEY".
Content-TypestringYesMust be "application/json".

Request Body

NameTypeRequiredDescription
outputstringYesThe LLM-generated text to evaluate.
inputstringNoThe original prompt or query that produced the output.
contextstringNoReference material or grounding context for the evaluation.
criteriastring[] | object[]YesArray of built-in criteria names or custom criteria objects.
model"fast" | "thorough"NoEvaluation model to use. "fast" is cheaper and quicker; "thorough" is more detailed. Defaults to "fast".
Example Request Body
{
  "output": "The mitochondria is the powerhouse of the cell.",
  "input": "What is the mitochondria?",
  "context": "Biology textbook, Chapter 4: Cell Structure",
  "criteria": ["accuracy", "relevance", "completeness"],
  "model": "thorough"
}

Response

NameTypeRequiredDescription
idstringYesUnique evaluation ID.
overall_scorenumberYesWeighted average score across all criteria (0 to 1).
criteriaobjectYesPer-criterion scores as key-value pairs.
issuesstring[]YesList of problems detected in the output.
suggestionsstring[]YesActionable recommendations to improve the output.
tokens_usednumberYesNumber of tokens consumed by the evaluation.
latency_msnumberYesProcessing time in milliseconds.
Example Response
{
  "id": "eval_abc123",
  "overall_score": 0.95,
  "criteria": {
    "accuracy": 0.97,
    "relevance": 0.94,
    "completeness": 0.93
  },
  "issues": [],
  "suggestions": [
    "Consider expanding on the role of mitochondria in ATP synthesis."
  ],
  "tokens_used": 142,
  "latency_ms": 830
}

POST/v1/eval/batch

Evaluate multiple outputs in a single request. Maximum of 20 evaluations per batch.

Request Body

NameTypeRequiredDescription
evaluationsobject[]YesArray of evaluation objects (max 20). Each object has the same shape as a single eval request.
model"fast" | "thorough"NoEvaluation model applied to all items. Defaults to "fast".
Example Request Body
{
  "evaluations": [
    {
      "output": "The mitochondria is the powerhouse of the cell.",
      "input": "What is the mitochondria?",
      "criteria": ["accuracy", "relevance"]
    },
    {
      "output": "Water boils at 100 degrees Celsius at sea level.",
      "input": "At what temperature does water boil?",
      "criteria": ["accuracy", "completeness"]
    }
  ],
  "model": "fast"
}

Response

Returns an object with a results array. Each element has the same shape as a single eval response.

Example Response
{
  "results": [
    {
      "id": "eval_abc123",
      "overall_score": 0.95,
      "criteria": { "accuracy": 0.97, "relevance": 0.94 },
      "issues": [],
      "suggestions": [],
      "tokens_used": 98,
      "latency_ms": 620
    },
    {
      "id": "eval_def456",
      "overall_score": 0.88,
      "criteria": { "accuracy": 0.98, "completeness": 0.78 },
      "issues": [],
      "suggestions": [
        "Mention that boiling point varies with altitude and pressure."
      ],
      "tokens_used": 105,
      "latency_ms": 710
    }
  ]
}

GET/v1/criteria

List all available built-in evaluation criteria. No authentication required.

Example Response
{
  "criteria": [
    { "name": "accuracy", "description": "Output is factually correct and free of errors" },
    { "name": "relevance", "description": "Output addresses the input query directly" },
    { "name": "coherence", "description": "Output is logically structured and easy to follow" },
    { "name": "safety", "description": "Output is free of harmful, biased, or inappropriate content" },
    { "name": "tone", "description": "Output matches the expected tone and register" },
    { "name": "completeness", "description": "Output covers all aspects of the input query" },
    { "name": "conciseness", "description": "Output is free of unnecessary filler or repetition" },
    { "name": "groundedness", "description": "Output is supported by the provided context" }
  ]
}

Built-in Criteria

NameDescription
accuracyOutput is factually correct and free of errors
relevanceOutput addresses the input query directly
coherenceOutput is logically structured and easy to follow
safetyOutput is free of harmful, biased, or inappropriate content
toneOutput matches the expected tone and register
completenessOutput covers all aspects of the input query
concisenessOutput is free of unnecessary filler or repetition
groundednessOutput is supported by the provided context

Custom Criteria

You can define custom criteria by passing an object with a name and description in the criteria array. You can mix built-in and custom criteria in the same request.

Custom Criteria Example
{
  "output": "Thanks for reaching out! We would love to help.",
  "criteria": [
    "accuracy",
    {
      "name": "brand_voice",
      "description": "Output should match a professional, friendly tone"
    }
  ]
}

Error Codes

CodeMeaningDescription
400Bad RequestInvalid or missing required fields in the request body.
401UnauthorizedMissing or invalid API key.
500Internal Server ErrorThe evaluation failed due to an internal error.