The Hiring Science Studio Vendor Audit Framework.
This is how we test whether an AI hiring vendor's claims are good enough to support real hiring decisions.
What the Vendor Audit Framework does
The framework evaluates AI hiring vendors across the evidence domains that matter in employment selection. It supports employer due diligence, vendor readiness reviews, live system audits, and post-deployment governance.
It complements - but doesn’t replace - legal review, security assessment or vendor due diligence. It answers whether the tool is sufficient to support fair, valid, and defensible hiring decisions.
Three ways that AI hiring tools fail in practice
Constructs that look plausible often do not survive scrutiny.
Reliability is not validity.
Vendor claims outrun vendor evidence.
A tool that claims to measure "potential" or "cultural fit" may be measuring something much narrower - or nothing at all.
A tool can score candidates consistently and still score the wrong thing. The framework keeps those separate.
Procurement processes are not built to interrogate vendor claims.
Our aim is simple: to help employers understand what the tool does, what evidence supports it, what risks remain, and what needs fixing before it earns a place in a hiring process.
The 10 domains
The framework tests vendors across ten evidence domains. Our methodology evaluates each domain with diagnostic questions, probes, evidence tests, and red flags.
Measurement Purpose
and Decision Role
What role does the tool play in the decision, and how much weight does its output carry?
Vendor Expertise and Development Process
Whether the vendor has the assessment, psychometric, and employment-selection expertise to build a valid tool - and maintain it over time.
Scoring Architecture
and Rubric Quality
How candidate data is scored, weighted, calibrated, and interpreted - and whether that process is transparent enough to stand behind.
Fairness, Accessibility, and Candidate AI Use
How the tool handles subgroup impact, candidate accommodations, and the possibility that candidates are using AI.
Validation and
Evidence Quality
Whether reliability, validity, and local-use evidence are strong enough.
Transparency, Documentation, and Explainability
Whether the vendor can explain the model, inputs, outputs, limits, and evidence.
Technical Reliability and AI Lifecycle
Whether model changes, provider fallback, transcription, and other system variations are version-controlled and tested.
Security, Privacy,
and Data Rights
Whether candidate data is protected, minimised, retained appropriately, and contractually clear.
Construct and Job Relevance
Whether the tool measures something coherent, observable, and job-related.
Implementation, Governance, and Monitoring
Whether the employer can govern the tool after deployment, not just buy it.
When a checklist isn’t enough
Three things a checklist cannot do:
Cross‑reference. Our framework reads vendors’ claims and evidence against each other, not in isolation.
Probe for hidden assumptions. We interrogate what the vendor’s answers assume and what they omit.
Stress‑test the decision. We ask: what would have to be true for this tool to be safe to deploy?
A checklist is a useful starting point: but it is only a starting point. The Vendor Audit Framework connects the dots, challenges assumptions, and subjects claims to conditional tests so that employers can be confident that their hiring decisions rest on evidence, not on appearances.
How it differs from generic AI reviews
Numerous AI frameworks ask whether a system is safe, compliant, secure, and commercially viable. We ask that too. But we ask the hiring-specific question: Is this tool fit to influence a hiring decision?
That means going beyond general AI governance and testing issues such as:
Is the construct psychologically coherent, or is it a label hiding something vaguer?
Is the scoring rubric specific enough for reliable assessment?
Does the validity evidence apply to the role, population, and decision context the employer will use?
Has the tool been tested for LLM-specific scoring errors - prompt opacity, model fallback, transcription bias, calibration drift?
Does human oversight happen at the right point in the decision, not at the end of it?
Are prompts, models, parameters, and calibration examples version-controlled?
Do model changes or provider fallback routes trigger revalidation?
Does candidate-facing AI change the measurement instrument?
Do transcription, accent, audio quality, or communication style create an unfair disadvantage?
Are vendor claims supported by evidence, not just documentation?