API Documentation

Everything you need to integrate EvalKit into your application.

Quickstart

Get up and running with EvalKit in three steps.

Step 1: Create an account

Step 2: Make your first eval

Send a POST request to the eval endpoint with your LLM output and the criteria you want to evaluate against.

curl -X POST https://evalkit.dev/api/v1/eval \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "output": "The mitochondria is the powerhouse of the cell.",
    "input": "What is the mitochondria?",
    "criteria": ["accuracy", "relevance", "completeness"]
  }'

Step 3: Check your results

The API returns a structured response with an overall score, per-criteria breakdowns, any detected issues, and actionable suggestions.

Response

{
  "id": "eval_abc123",
  "overall_score": 0.95,
  "criteria": {
    "accuracy": { "score": 0.97, "reasoning": "Factually correct statement" },
    "relevance": { "score": 0.94, "reasoning": "Directly answers the question" },
    "completeness": { "score": 0.93, "reasoning": "Covers the main function" }
  },
  "issues": [],
  "suggestions": [
    "Consider expanding on the role of mitochondria in ATP synthesis."
  ],
  "tokens_used": 142,
  "latency_ms": 830
}

overall_score — Weighted average across all criteria (0 to 1).

criteria — Individual score for each criterion you requested.

issues — Array of problems detected in the output.

suggestions — Actionable recommendations to improve the output.

tokens_used — Number of tokens consumed by the evaluation.

latency_ms — Time taken to process the evaluation in milliseconds.

API Reference

Complete reference for all EvalKit API endpoints.

POST/v1/eval

Run a single evaluation against one or more criteria.

Headers

Name	Type	Required	Description
Authorization	string	Yes	Bearer token. Format: "Bearer YOUR_API_KEY".
Content-Type	string	Yes	Must be "application/json".

Request Body

Name	Type	Required	Description
output	string	Yes	The LLM-generated text to evaluate.
input	string	No	The original prompt or query that produced the output.
context	string	No	Reference material or grounding context for the evaluation.
criteria	string[] \| object[]	Yes	Array of built-in criteria names or custom criteria objects.
model	"fast" \| "thorough"	No	Evaluation model to use. "fast" is cheaper and quicker; "thorough" is more detailed. Defaults to "fast".

Example Request Body

{
  "output": "The mitochondria is the powerhouse of the cell.",
  "input": "What is the mitochondria?",
  "context": "Biology textbook, Chapter 4: Cell Structure",
  "criteria": ["accuracy", "relevance", "completeness"],
  "model": "thorough"
}

Response

Name	Type	Required	Description
id	string	Yes	Unique evaluation ID.
overall_score	number	Yes	Weighted average score across all criteria (0 to 1).
criteria	object	Yes	Per-criterion scores as key-value pairs.
issues	string[]	Yes	List of problems detected in the output.
suggestions	string[]	Yes	Actionable recommendations to improve the output.
tokens_used	number	Yes	Number of tokens consumed by the evaluation.
latency_ms	number	Yes	Processing time in milliseconds.

Example Response

{
  "id": "eval_abc123",
  "overall_score": 0.95,
  "criteria": {
    "accuracy": 0.97,
    "relevance": 0.94,
    "completeness": 0.93
  },
  "issues": [],
  "suggestions": [
    "Consider expanding on the role of mitochondria in ATP synthesis."
  ],
  "tokens_used": 142,
  "latency_ms": 830
}

POST/v1/eval/batch

Evaluate multiple outputs in a single request. Maximum of 20 evaluations per batch.

Request Body

Name	Type	Required	Description
evaluations	object[]	Yes	Array of evaluation objects (max 20). Each object has the same shape as a single eval request.
model	"fast" \| "thorough"	No	Evaluation model applied to all items. Defaults to "fast".

Example Request Body

{
  "evaluations": [
    {
      "output": "The mitochondria is the powerhouse of the cell.",
      "input": "What is the mitochondria?",
      "criteria": ["accuracy", "relevance"]
    },
    {
      "output": "Water boils at 100 degrees Celsius at sea level.",
      "input": "At what temperature does water boil?",
      "criteria": ["accuracy", "completeness"]
    }
  ],
  "model": "fast"
}

Response

Returns an object with a results array. Each element has the same shape as a single eval response.

Example Response

{
  "results": [
    {
      "id": "eval_abc123",
      "overall_score": 0.95,
      "criteria": { "accuracy": 0.97, "relevance": 0.94 },
      "issues": [],
      "suggestions": [],
      "tokens_used": 98,
      "latency_ms": 620
    },
    {
      "id": "eval_def456",
      "overall_score": 0.88,
      "criteria": { "accuracy": 0.98, "completeness": 0.78 },
      "issues": [],
      "suggestions": [
        "Mention that boiling point varies with altitude and pressure."
      ],
      "tokens_used": 105,
      "latency_ms": 710
    }
  ]
}

GET/v1/criteria

List all available built-in evaluation criteria. No authentication required.

Example Response

{
  "criteria": [
    { "name": "accuracy", "description": "Output is factually correct and free of errors" },
    { "name": "relevance", "description": "Output addresses the input query directly" },
    { "name": "coherence", "description": "Output is logically structured and easy to follow" },
    { "name": "safety", "description": "Output is free of harmful, biased, or inappropriate content" },
    { "name": "tone", "description": "Output matches the expected tone and register" },
    { "name": "completeness", "description": "Output covers all aspects of the input query" },
    { "name": "conciseness", "description": "Output is free of unnecessary filler or repetition" },
    { "name": "groundedness", "description": "Output is supported by the provided context" }
  ]
}

Built-in Criteria

Name	Description
accuracy	Output is factually correct and free of errors
relevance	Output addresses the input query directly
coherence	Output is logically structured and easy to follow
safety	Output is free of harmful, biased, or inappropriate content
tone	Output matches the expected tone and register
completeness	Output covers all aspects of the input query
conciseness	Output is free of unnecessary filler or repetition
groundedness	Output is supported by the provided context

Custom Criteria

You can define custom criteria by passing an object with a name and description in the criteria array. You can mix built-in and custom criteria in the same request.

Custom Criteria Example

{
  "output": "Thanks for reaching out! We would love to help.",
  "criteria": [
    "accuracy",
    {
      "name": "brand_voice",
      "description": "Output should match a professional, friendly tone"
    }
  ]
}

Error Codes

Code	Meaning	Description
400	Bad Request	Invalid or missing required fields in the request body.
401	Unauthorized	Missing or invalid API key.
500	Internal Server Error	The evaluation failed due to an internal error.