AIb2.io - AI Research Decoded

The Curious Case of the Interview-Scoring Automaton

Task-specific labeled training data for supervised interview-scoring models is the bottleneck this paper attempts to remove, and good heavens, what a bottleneck it is: thousands of carefully scored answers, harvested and labeled by humans, before the machine may even begin its little apprenticeship.

Stockdale, Hickman, and Liu ask a very practical question in The Journal of Applied Psychology: can large language models score employment interviews without being trained from scratch for each new hiring task? In older supervised machine learning systems, the model needed examples: answer, human score, answer, human score, and onward until all parties involved wished to lie down in a darkened room. LLMs arrive with a different promise. They have already read half the internet, some of the other half, and probably the assembly instructions for a toaster in Czech. Perhaps, with the right prompt, they can act as interview raters out of the box.

The Curious Case of the Interview-Scoring Automaton

The authors do not merely release the mechanical bird and declare it sings. They put it in a cage, vary the seed, weigh the feathers, and keep notes.

The Specimen Under Glass

The study tested LLM interview scoring across two datasets. One dataset involved 954 interviewees scored on Big Five personality traits. The other involved 144 interviewees scored on targeted constructs using Behaviorally Anchored Rating Scales, mercifully shortened to BARS, because psychology already has enough syllables wandering around unsupervised.

The researchers varied four design ingredients: prompt design, model selection, hyperparameters such as temperature, and the number of LLM "interviewers" combined into an ensemble. They then inspected the resulting scores like proper psychometric naturalists: reliability, test-retest consistency, convergence with human ratings, discriminant validity, criterion-related validity, group differences, and measurement bias.

That list sounds dry until you remember the setting. These are employment interviews. A score may help decide who gets a job, who gets rejected, and who spends the afternoon refreshing their inbox with the haunted dignity of a Victorian widow awaiting a telegram.

Teaching the Automaton Manners

The paper's most useful finding is that LLM raters behave better when they receive more structure. Larger, newer models, especially when prompted with detailed construct information, produced scores with psychometric properties comparable to or better than supervised ML models and single human raters in some comparisons. Ensembles also helped, which is pleasingly familiar: when one judge may be whimsical, summon a small committee of machines and average their temperaments.

The prompt mattered. Telling the model what construct to score, giving definitions, and including behavioral anchors improved the situation. That matches older interview wisdom: structured interviews generally work better than the "I liked their vibe" school of personnel selection, a school whose mascot is a clipboard with no notes on it.

Temperature mattered too. In LLMs, temperature controls randomness in word choice. Low temperature makes the model more consistent; high temperature lets it become a tiny poet with hiring authority, which is precisely the sort of creature one should not place near an applicant tracking system. The authors found lower temperature produced modest gains in reliability and convergent correlations.

Psychometrics, Or: Does the Brass Owl Measure What It Claims?

This is where the paper earns its keep. The authors do not ask only, "Can the LLM produce a number?" A vending machine can produce a number if sufficiently provoked. They ask whether that number behaves like a defensible assessment score.

Psychometrics gives us the apparatus. Reliability asks whether the score is stable. Validity asks whether the score supports the interpretation people want to make from it. Construct validity asks whether the score reflects the intended trait rather than some lurking imposter, such as verbosity, accent, confidence, or the candidate's talent for sounding like a LinkedIn post that achieved consciousness.

The results are promising, but not permission to throw human judgment into the sea. The authors found adverse impact concerns: LLM scores sometimes showed larger group differences than human or supervised ML ratings. They also found some evidence of measurement bias, though often with uncertainty overlapping supervised ML models. In plainer language: the machine can be useful, but it must be audited like a suspicious accountant.

This connects with nearby work. Zhang and colleagues found that LLMs can infer personality from asynchronous video interview responses, but validity, reliability, fairness, and rating patterns all need careful checking. Gaebler and colleagues audited LLMs in hiring-like decisions and found moderate race and gender disparities. Vaishampayan and colleagues compared human and LLM resume matching at NAACL 2025, another reminder that hiring AI is not a magic lantern. It is a measuring instrument, and measuring instruments require calibration, documentation, and adults in the room.

The Practical Moral

If these findings reproduce and expand, LLM interview scoring could reduce the cost and delay of evaluating open-ended responses. Smaller organizations might use structured scoring without building a bespoke supervised model. Researchers could run richer experiments. Recruiters might get more consistent first-pass evidence.

But the best practice is not "ask ChatGPT who to hire," a phrase that should be sealed in a lead box. The sensible version is narrower: use newer capable models, keep temperature low, provide construct definitions and BARS, average multiple ratings where feasible, validate against human and job-relevant criteria, and check subgroup outcomes before deployment.

The paper's real contribution is not that LLMs can score interviews. It is that LLM raters need their own evaluation design science. We cannot simply staple human-rater rules onto a transformer and call it Tuesday. The specimen is new, lively, and occasionally alarming. Catalog it carefully.

References

Stockdale, K., Hickman, L., & Liu, S. (2026). Scoring employment interviews with large language models: Evaluation design components, validity investigations, and best practice recommendations. The Journal of Applied Psychology. https://doi.org/10.1037/apl0001396

Zhang, T., Koutsoumpis, A., Oostrom, J. K., Holtrop, D., Ghassemi, S., & de Vries, R. E. (2024). Can Large Language Models Assess Personality From Asynchronous Video Interviews? IEEE Transactions on Affective Computing, 15(3), 1769-1785. https://doi.org/10.1109/TAFFC.2024.3374875

Gaebler, J. D., Goel, S., Huq, A., & Tambe, P. (2024). Auditing the Use of Language Models to Guide Hiring Decisions. arXiv:2404.03086. https://arxiv.org/abs/2404.03086

Vaishampayan, S., Leary, H., Alebachew, Y. B., Hickman, L., Stevenor, B. A., Beck, W., & Brown, C. (2025). Human and LLM-Based Resume Matching: An Observational Study. Findings of NAACL 2025. https://doi.org/10.18653/v1/2025.findings-naacl.270

Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.