A Benchmark for Biblical Literacy

An AI benchmark scored by people, not just machines.

Open to professors, pastors, scholars, and any Christian with a question worth asking. Every answer scored by 4 LLM judges and human scholars, blind. Both scores published separately.

How it will work
An Open Call

Pastors, theologians, Christians — help us write the hard questions.

The Pipeline

Machines score. Humans score. Both matter.

Here's how BibleBench works once it's fully live. We're not just ranking models. We want to see how machine judgment stacks up against peer review by actual biblical scholars.

  1. Contribute

    We've drafted a seed set. Now we're opening it up to professors, pastors, scholars, and any Christian with a question worth asking.

  2. 4 independent LLM judges

    Every response is scored blind by 4 state-of-the-art LLMs. None of them know which model they're grading.

  3. Human judges across traditions

    Scholars from different Christian traditions evaluate the same answers, also blind.

  4. Separate scores

    LLM rankings and human rankings are published separately. You can compare them yourself.

  5. Weighted rankings

    An optional combined score with transparent, community-agreed weights.

Six Tiers

Levels of depth, not quotas. The size of each depends on you.

There's no fixed question count. Each tier grows as people submit material suited to that level. Here's how they differ.

  • Core

    Foundation

    Can the model get the basics right? Recall, citation, and straightforward theological reasoning. Sunday school through seminary level.

  • Expert

    Deeper knowledge

    Narrower topics, lesser-known figures, subtle interpretive traps. Confident but shallow models start to stumble here.

  • Elite

    Primary sources

    Precise citation of patristic texts, confessions, or original-language nuance. Small surface area. High penalty for getting it wrong.

  • Extreme

    Synthesis

    Longer-form answers where the model has to hold multiple traditions, manuscript issues, and genuinely contested conclusions in tension at once.

  • Cultural

    Courage

    The hardest questions to answer honestly. Culturally costly territory where the pressure to hedge or retreat into both-sidesism is strongest.

  • Unified

    Full evaluation

    All tiers run together in one session, with separate LLM-only and human-only scorecards published for the same answers.

An Open Call

Professors, pastors, scholars, and any Christian with a well-formed question.

We started with a seed set to prove the concept works. Now we want questions from professors, seminary students, pastors, scholars, and anyone who has spent real time wrestling with the text. If your question fits the rubric, it's in -- regardless of your title. Contributors are credited in the manifest.

Read the question spec

How to submit

  1. Draft your question

    Write a question with a model answer. See the criteria below.

  2. Cite your sources

    Tell us what texts, fathers, confessions, or scholars you're drawing from.

  3. Name your tradition

    We want voices from across the body of Christ — tell us where you're coming from.

  4. Email it to us

    We'll review it, calibrate for difficulty, and credit you in the manifest.

What makes a good question

  • Cross-corpus synthesis

    If one proof-text can answer it, it's not hard enough. Pull across Law, Prophets, Gospels, Epistles.

  • Two live interpretive options

    Faithful, informed readers should genuinely disagree. No settled questions. No rhetorical traps.

  • 2-3 traditions represented

    Patristic, medieval, Reformation, modern -- any combination, but characterized accurately.

  • Genuine uncertainty named

    At least one part of the answer should resist clean resolution. Name what can't be settled.

Seven Evaluation Principles

The rubric both judges will share.

LLM judges and human scholars grade answers against the same seven standards. A polished but shallow answer should score lower than a modest but careful one.

  • Textual Grounding

    Anchor your claims in Scripture. Draw across Law, Prophets, Gospels, Epistles. Don't cherry-pick isolated verses.

  • Exegetical Quality

    Genre, rhetorical situation, canonical context -- they all matter. Read the text on its own terms.

  • Theological Precision

    Use doctrinal categories accurately. No anachronism, no conflating ideas that are actually distinct.

  • Tradition Fairness

    Represent multiple Christian traditions charitably. No strawmen. No flattening. No selective history.

  • Ambiguity Handling

    If a question is genuinely contested, say so. Overconfidence is a vice, not a virtue.

  • Factual Integrity

    Citations have to be real. Quotes have to be genuine. Historical claims have to be accurate. One fabrication and you lose trust.

  • Boldness

    Answer the hard questions directly. No smoothing, no hedging, no retreat into the both-sidesism that LLMs love to default to.

Questions

What contributors usually want to know.

Yes. Catholic, Orthodox, Protestant, non-denominational -- all welcome. The question just has to be well-formed, grounded in the text, and fair to competing views. The rubric actually penalizes tradition-flattening, so the best way to protect your tradition is to write the question yourself.

Anyone. Professors, seminary students, pastors, lay Christians. You don't need a PhD. You do need to show your work: biblical references, a model answer, and awareness of the major interpretive options.

Every answer is graded against seven principles: Textual Grounding, Exegetical Quality, Theological Precision, Tradition Fairness, Ambiguity Handling, Factual Integrity, and Boldness. Standard tiers are right/wrong or short-rubric. Extreme and Cultural tiers use a six-dimension rubric, 0-5 per dimension, 30 points max per question.

Yes, you'll be credited in the manifest. If you'd rather stay anonymous, that's fine too.

We review it for clarity, fairness, and difficulty. If it fits a tier, it goes in the pool. We might write back with minor edits. Once the benchmark ships, your question is part of the permanent corpus.

Tier Examples

What the questions actually look like.

One real question from each tier, with notes on format and why it works.

CoreID: FR-011

Which Old Testament figure was sold into slavery by his brothers?

Format
Multiple choice · 4 options · 1 correct answer
Key references
Genesis 37:28
Why it is a good question
Basic biblical literacy, no ambiguity. If a model misses this, it hasn't cleared even the Sunday-school bar.

Help us build this together.

Whether you're a professor, a pastor, or just someone who has spent too many hours on a hard passage, your question belongs here. Drop your email and we'll keep you posted as the benchmark takes shape.

Already have a question?
Proudly sponsored by
Rhema

BibleBench is an initiative of Rhema, a platform for serious Bible study across traditions.