An AI benchmark scored by people, not just machines.
Open to professors, pastors, scholars, and any Christian with a question worth asking. Every answer scored by 4 LLM judges and human scholars, blind. Both scores published separately.

Pastors, theologians, Christians — help us write the hard questions.
Machines score. Humans score. Both matter.
Here's how BibleBench works once it's fully live. We're not just ranking models. We want to see how machine judgment stacks up against peer review by actual biblical scholars.
Contribute
We've drafted a seed set. Now we're opening it up to professors, pastors, scholars, and any Christian with a question worth asking.
4 independent LLM judges
Every response is scored blind by 4 state-of-the-art LLMs. None of them know which model they're grading.
Human judges across traditions
Scholars from different Christian traditions evaluate the same answers, also blind.
Separate scores
LLM rankings and human rankings are published separately. You can compare them yourself.
Weighted rankings
An optional combined score with transparent, community-agreed weights.
Levels of depth, not quotas. The size of each depends on you.
There's no fixed question count. Each tier grows as people submit material suited to that level. Here's how they differ.
- Core
Foundation
Can the model get the basics right? Recall, citation, and straightforward theological reasoning. Sunday school through seminary level.
- Expert
Deeper knowledge
Narrower topics, lesser-known figures, subtle interpretive traps. Confident but shallow models start to stumble here.
- Elite
Primary sources
Precise citation of patristic texts, confessions, or original-language nuance. Small surface area. High penalty for getting it wrong.
- Extreme
Synthesis
Longer-form answers where the model has to hold multiple traditions, manuscript issues, and genuinely contested conclusions in tension at once.
- Cultural
Courage
The hardest questions to answer honestly. Culturally costly territory where the pressure to hedge or retreat into both-sidesism is strongest.
- Unified
Full evaluation
All tiers run together in one session, with separate LLM-only and human-only scorecards published for the same answers.
Professors, pastors, scholars, and any Christian with a well-formed question.
We started with a seed set to prove the concept works. Now we want questions from professors, seminary students, pastors, scholars, and anyone who has spent real time wrestling with the text. If your question fits the rubric, it's in -- regardless of your title. Contributors are credited in the manifest.
How to submit
- Draft your question
Write a question with a model answer. See the criteria below.
- Cite your sources
Tell us what texts, fathers, confessions, or scholars you're drawing from.
- Name your tradition
We want voices from across the body of Christ — tell us where you're coming from.
- Email it to us
We'll review it, calibrate for difficulty, and credit you in the manifest.
What makes a good question
- Cross-corpus synthesis
If one proof-text can answer it, it's not hard enough. Pull across Law, Prophets, Gospels, Epistles.
- Two live interpretive options
Faithful, informed readers should genuinely disagree. No settled questions. No rhetorical traps.
- 2-3 traditions represented
Patristic, medieval, Reformation, modern -- any combination, but characterized accurately.
- Genuine uncertainty named
At least one part of the answer should resist clean resolution. Name what can't be settled.
The rubric both judges will share.
LLM judges and human scholars grade answers against the same seven standards. A polished but shallow answer should score lower than a modest but careful one.
Textual Grounding
Anchor your claims in Scripture. Draw across Law, Prophets, Gospels, Epistles. Don't cherry-pick isolated verses.
Exegetical Quality
Genre, rhetorical situation, canonical context -- they all matter. Read the text on its own terms.
Theological Precision
Use doctrinal categories accurately. No anachronism, no conflating ideas that are actually distinct.
Tradition Fairness
Represent multiple Christian traditions charitably. No strawmen. No flattening. No selective history.
Ambiguity Handling
If a question is genuinely contested, say so. Overconfidence is a vice, not a virtue.
Factual Integrity
Citations have to be real. Quotes have to be genuine. Historical claims have to be accurate. One fabrication and you lose trust.
Boldness
Answer the hard questions directly. No smoothing, no hedging, no retreat into the both-sidesism that LLMs love to default to.
What contributors usually want to know.
Yes. Catholic, Orthodox, Protestant, non-denominational -- all welcome. The question just has to be well-formed, grounded in the text, and fair to competing views. The rubric actually penalizes tradition-flattening, so the best way to protect your tradition is to write the question yourself.
Anyone. Professors, seminary students, pastors, lay Christians. You don't need a PhD. You do need to show your work: biblical references, a model answer, and awareness of the major interpretive options.
Every answer is graded against seven principles: Textual Grounding, Exegetical Quality, Theological Precision, Tradition Fairness, Ambiguity Handling, Factual Integrity, and Boldness. Standard tiers are right/wrong or short-rubric. Extreme and Cultural tiers use a six-dimension rubric, 0-5 per dimension, 30 points max per question.
Yes, you'll be credited in the manifest. If you'd rather stay anonymous, that's fine too.
We review it for clarity, fairness, and difficulty. If it fits a tier, it goes in the pool. We might write back with minor edits. Once the benchmark ships, your question is part of the permanent corpus.
What the questions actually look like.
One real question from each tier, with notes on format and why it works.
Which Old Testament figure was sold into slavery by his brothers?
- Format
- Multiple choice · 4 options · 1 correct answer
- Key references
- Genesis 37:28
- Why it is a good question
- Basic biblical literacy, no ambiguity. If a model misses this, it hasn't cleared even the Sunday-school bar.
Help us build this together.
Whether you're a professor, a pastor, or just someone who has spent too many hours on a hard passage, your question belongs here. Drop your email and we'll keep you posted as the benchmark takes shape.
