Hiring rubrics → Eval design for hiring agents
What you screen the AI for is the AI version of what you screen the candidate for.
The parent rubric screens humans for commercial acumen, Purpose facilitation, and Same-Breath communication craft. The function tests for these dimensions in the loop and develops them in the first 12 months on the job.
When an AI hiring agent assesses candidates, the parallel question is: what do you screen the AI for? The eval suite tests the same dimensions the rubric named, plus dimensions the human rubric never had to worry about because humans naturally handle them: disparate impact, calibration drift, adversarial robustness, and coherence with the published policy spec. The eval suite is the AI-systems analog of the hiring loop. Without it, the agent ships untested against the rubric the function thinks it is enforcing.
- Build the held-out validation set before the agent ships. Calibrated humans score the validation set against the rubric. The AI's outputs get compared against this ground truth before launch and quarterly thereafter. The validation set covers the rubric dimensions, demographic diversity, and adversarial cases. No agent ships without a validation set.
- Run the bias audit pre-deployment with veto authority. Split the validation set by demographic. AI scores must not differ systematically across groups when human scores do not. The HR Trust & Safety team owns the audit and has veto authority on launch. Bias audits run quarterly post-deployment; drift triggers retraining.
- Adversarial-test the agent before it sees real candidates. Include candidates using LLM-polished cover letters, candidates with credentials that look strong on paper but fail in interview, candidates with non-traditional backgrounds. The agent must handle each robustly. The adversarial set grows as new gaming patterns appear in the field.
- Treat agent re-training as a launch, not a patch. Each new training run goes through the same pre-deployment evaluation as the original. No silent updates. Versioning, changelogs, and a deprecation path for the previous version. The function reports agent versions to leadership the same way it reports hiring rubric versions.
- ×Treating the AI like a black box and trusting its outputs because the model is well-trained. The model's outputs are evidence about the model, not evidence about the candidate.
- ×Skipping the bias audit because there are not enough samples per demographic. There are not enough samples is a reason to slow the deployment, not skip the audit.
- ×Deploying without a validation set because building one is expensive. The first lawsuit is more expensive.