Companion paper · The AI-systems layer

The AI-systems layer of the adaptive People function.

How the function operates the agents that mediate its work. Six instruments, six AI analogs, and the operating disciplines required at each.

Catalyst question
Open Question 4
From the parent paper
Author
Rahul Jindal
Drawing on 20 years driving transformations
What this paper is

The AI Adaptive People Function paper named the operating instruments the function needs. The Roles and Skills companion named who runs them. This paper names what the function does at the AI-systems layer: the layer that emerges when the agents themselves mediate the work the human instruments used to handle.

This is the layer most People functions are about to discover the hard way, because it does not show up in the normal operating cadence until the agent fails in a high-stakes case. By the time you discover the eval suite was missing, the disparate-impact lawsuit is filed. By the time you discover the calibration-drift monitoring was missing, three quarters of comp recommendations have been miscalibrated. By the time you discover the comms gate did not include AI-drafted memos, the RIF announcement landed without a focus-forward paragraph in any region.

Each of the six Foundation interventions from the parent paper has an AI-systems analog. Each analog is a separate operational discipline. Each requires a specific role to own it. Each fails silently if not built. This paper sketches the six analogs at a level of specificity a CHRO can take to a workforce-planning meeting, with the same operator-grade structure the parent paper used: rationale, named moves, failure modes.

The paper assumes the parent paper's framing (EMI, Purpose, Same-Breath, capability supply chain) and the Roles and Skills companion's (eight skill primitives, twelve role evolutions, the reskilling limit). Read both first if you have not.

Why this layer is different

Three properties of the AI-systems layer

Three properties of the AI-systems layer make it different from the human instruments it parallels.

First, it fails silently. A human instrument that breaks down has a face attached to the failure: a manager who is not having Purpose conversations, a calibrator who is rating inconsistently. An AI instrument that drifts produces outputs that look like the instrument is working until you measure carefully. The function that is not measuring will not see the failure until it lands as a lawsuit, a regulatory action, or an attrition spike that someone else points to first.

Second, it requires a different operating cadence. Human instruments run on quarterly review cycles; AI instruments run continuously and need continuous monitoring. The function that runs AI instruments on quarterly review cycles is running them at a hundredth of the resolution they require. The dashboards, the alerts, the on-call rotations, the post-mortems: this is operations engineering applied to the People function, not the other way around.

Third, it requires a partnership the function does not yet have. The People function will not build agents alone. It will partner with engineering, ML, and Trust & Safety. The capability to write requirements, read evals, push back on launches, and own the policy spec lives in the HRBP-PgM-engineer triangle, and that triangle does not exist in most functions today. Standing it up is the first investment.

The six analogs

One AI-systems analog per Foundation intervention

For each of the six Foundation interventions in the parent paper, the AI-systems analog: what the parent did at the human-instrument layer, what the function does when agents mediate the work, the operating moves, and the failure modes that surface only when the agent fails in a high-stakes case.

01
AI-systems analog

Hiring rubrics → Eval design for hiring agents

What you screen the AI for is the AI version of what you screen the candidate for.

Parent instrument

The parent rubric screens humans for commercial acumen, Purpose facilitation, and Same-Breath communication craft. The function tests for these dimensions in the loop and develops them in the first 12 months on the job.

AI-systems layer

When an AI hiring agent assesses candidates, the parallel question is: what do you screen the AI for? The eval suite tests the same dimensions the rubric named, plus dimensions the human rubric never had to worry about because humans naturally handle them: disparate impact, calibration drift, adversarial robustness, and coherence with the published policy spec. The eval suite is the AI-systems analog of the hiring loop. Without it, the agent ships untested against the rubric the function thinks it is enforcing.

Operational moves
  • Build the held-out validation set before the agent ships. Calibrated humans score the validation set against the rubric. The AI's outputs get compared against this ground truth before launch and quarterly thereafter. The validation set covers the rubric dimensions, demographic diversity, and adversarial cases. No agent ships without a validation set.
  • Run the bias audit pre-deployment with veto authority. Split the validation set by demographic. AI scores must not differ systematically across groups when human scores do not. The HR Trust & Safety team owns the audit and has veto authority on launch. Bias audits run quarterly post-deployment; drift triggers retraining.
  • Adversarial-test the agent before it sees real candidates. Include candidates using LLM-polished cover letters, candidates with credentials that look strong on paper but fail in interview, candidates with non-traditional backgrounds. The agent must handle each robustly. The adversarial set grows as new gaming patterns appear in the field.
  • Treat agent re-training as a launch, not a patch. Each new training run goes through the same pre-deployment evaluation as the original. No silent updates. Versioning, changelogs, and a deprecation path for the previous version. The function reports agent versions to leadership the same way it reports hiring rubric versions.
Failure modes
  • ×Treating the AI like a black box and trusting its outputs because the model is well-trained. The model's outputs are evidence about the model, not evidence about the candidate.
  • ×Skipping the bias audit because there are not enough samples per demographic. There are not enough samples is a reason to slow the deployment, not skip the audit.
  • ×Deploying without a validation set because building one is expensive. The first lawsuit is more expensive.
02
AI-systems analog

Manager performance management → Agent performance management

Agents that mediate human interactions need the same performance discipline managers do.

Parent instrument

The parent rubric rates managers on Same-Breath communication quality. Quarterly reviews, sample memos, calibrated artifact ratings, redo cycles.

AI-systems layer

AI agents that mediate manager-employee interactions need their own performance rubric. The dimensions transfer; the operationalization is different. Did the agent surface what the team should focus on next when retiring a workflow? Did the agent escalate ambiguity to a human, or guess? Did the agent's response respect the published policy spec? The agent is a manager's manager-in-the-machine, and the function's performance discipline has to extend to it. Tolerating drift in agent performance because retraining is expensive is the AI-systems version of letting a manager skip the Same-Breath check because coaching is expensive.

Operational moves
  • Each agent surface gets a quarterly performance review. Same cadence as managers. Same rubric structure: rationale, named criteria, scored binary or on a tight scale, calibrated across reviewers. The performance rubric is published, versioned, and updated when policy changes. Reviewers come from the HR Trust & Safety team and the function partner who owns the agent surface.
  • Agents that do not meet bar get retrained or deprecated. Not improved gradually. The deprecation path is published before the agent ships, and the function uses it. An agent that has been below bar for two quarters and has not improved despite intervention gets pulled. The deprecation is the discipline that keeps the bar real.
  • Agent performance reports to leadership alongside manager performance. The CPO's quarterly readout includes agent performance metrics alongside manager performance metrics. Same prominence, same accountability. Leadership starts asking about agent performance the same way they ask about manager performance, and the function gets the air cover to enforce the bar.
  • Track output quality, not output volume. Volume metrics (cases handled per hour, comms drafted per week) are vanity. The performance review measures whether the outputs were correct, well-reasoned, and policy-coherent. A high-volume agent producing low-quality outputs is a failure mode, not a success.
Failure modes
  • ×It is just a tool. The reflexive refusal to apply manager-grade performance discipline to agents. The function would not let a manager run unmeasured for a year; it should not let an agent.
  • ×Tolerating drift because retraining is expensive. The cost of drift compounds quietly; the cost of retraining is visible. The function that optimizes for the visible cost ships drift.
  • ×Mistaking output volume for output quality. Volume looks great on the dashboard until the quality review surfaces what the volume contained.
03
AI-systems analog

Capability building → Operating the AI stack

The function cannot operate what it cannot read.

Parent instrument

The parent paper's six-module curriculum builds HRBP capability: commercial fluency, Purpose facilitation, Same-Breath communication, EMI diagnostic administration, transition-curve conversations, listening telemetry interpretation.

AI-systems layer

The AI-systems layer requires a seventh and an eighth module. Module seven is operating the AI stack: eval design literacy, error-budget reasoning, policy-as-code authoring, bias audit interpretation, agent failure-mode recognition. Module eight is cross-functional collaboration with engineering and ML: writing requirements, reading evals, pushing back on launches, owning the policy spec. These are not optional. The function that does not build the seventh module ships agents it cannot read; the function that does not build the eighth ships agents it cannot push back on.

Operational moves
  • Add Module 7 to the curriculum immediately. Operating the AI stack. Six weeks. Hands-on rotation through the HR Trust & Safety function as part of the module. Practical exit test: each cohort member ships one agent eval suite during the module. Senior leaders complete the module first; the recursion principle from the parent paper holds.
  • Add Module 8 after Phase 1 of the rebuild. Cross-functional collaboration with engineering and ML. Pair with rotations through the partner functions. Practical exit test: each cohort member co-authors an agent requirements document with an engineering counterpart and gets it through a launch review. Module 8 cannot precede Module 7.
  • Hire a small senior cohort externally to seed the capability. The function will not build Module 7 fluency from scratch in a single curriculum cycle. Hire 5-10 senior HR Agent PMs and Trust & Safety leads from outside, run them through Module 7 first as the calibrators, then have them teach the rest of the function. Without external seeding, the curriculum is the blind teaching the blind.
  • Make agent operation part of the operating cadence, not the curriculum cadence. The seventh-module muscle atrophies between cohorts unless the function uses it daily. Every quarterly business review includes agent-performance metrics; every launch review includes evals; every comms-gate sample is reviewed by Module-7-graduate HRBPs. The operating cadence pulls the muscle.
Failure modes
  • ×We will hire for that capability rather than build it. Produces a function dependent on a few external hires who become single points of failure. When they leave, the capability leaves with them.
  • ×Treating the curriculum as separate from operating cadence. People learn the vocabulary, not the practice. Six months later they cannot tell you what an error budget for a hiring agent should be.
  • ×Module 7 without practical exit test. The cohort feels educated; the function gains nothing operational.
04
AI-systems analog

Listening telemetry → Agent-instrumented immune-system signals

Stop polling humans every quarter. Instrument the agents themselves.

Parent instrument

The parent paper added four pulse items that run quarterly: Purpose articulation, Same-Breath when retiring workflows, manager Purpose conversations, and named-next-thing within 30 days.

AI-systems layer

When agents mediate the work, the same signals can be surfaced from agent telemetry continuously. The pulse becomes a real-time dashboard, not a quarterly survey. The dashboard layers underneath the human pulse, not in place of it (the agent telemetry has different blind spots), but the resolution shift is real. The function moves from quarterly snapshots to continuous monitoring, and the operating cadence speeds up to match.

Operational moves
  • Instrument each agent surface with telemetry mapped to immune-system signals. The AI assistant employees use for role conversations is the place to detect Purpose-vs-task framing. The AI that drafts retirement comms is the place to measure Same-Breath first-time accept rate. The AI that supports manager-employee interactions is the place to detect whether conversations surfaced Purpose or stayed at task. The redeployment-desk agent owns the data on whether affected ICs landed a named role.
  • Build a real-time dashboard with action thresholds. Same logic as the parent paper's quarterly thresholds, applied to a continuous signal. When Item 1 drops below the threshold for a team, a Phase 0 audit is triggered automatically. When Item 2 drops, Same-Breath coaching is scheduled with the manager. The dashboard is operational, not informational.
  • Triangulate against the human pulse quarterly. The agent telemetry will diverge from the human pulse in some teams. Sometimes the agent is right (the pulse is gamed); sometimes the pulse is right (the agent is missing context). The triangulation produces the actual signal. Run the comparison every quarter and publish the gaps.
  • Address privacy and consent up front, not after. Instrumenting employee-facing AI surfaces real privacy questions. The function has to be explicit about what is collected, what it is used for, what is anonymized, what is identifiable. Get this right before the dashboard ships. The cost of getting it right is a few months of legal and ER work; the cost of getting it wrong is the dashboard being shut down by the works council in month four.
Failure modes
  • ×Replacing the human pulse entirely. The agent telemetry has different blind spots. Triangulation, not replacement.
  • ×Building the dashboard without acting on it. The dashboard runs, the numbers are reported, no intervention follows. The signal erodes; the team stops believing the dashboard.
  • ×Privacy issues blowing up the rollout in month four. Fix this in month one.
05
AI-systems analog

Comms governance → AI-drafted comms gate

AI drafts increasingly mediate transformation comms. The gate has to mediate the AI.

Parent instrument

The parent paper defined a Same-Breath gate: every transformation memo passes through a one-criterion review (focus-forward in the same paragraph as retirement) within 24 hours. The People function takes editorial ownership.

AI-systems layer

When AI drafts the memos, the gate becomes an automated check plus sampled human review. The classifier is itself versioned, evaluated, and drift-monitored. The highest-stakes comms (executive announcements, RIF, compensation policy changes) require human authorship; AI suggests, does not draft. The function that lets AI draft RIF announcements without a human author has not yet understood what the gate is for.

Operational moves
  • Ship the classifier before the AI drafting agent. No memo gets drafted by AI without the gate already in place and operating. The classifier checks the same criterion the human gate checks: focus-forward in the same paragraph as retirement. The classifier ships with a held-out evaluation set scored by calibrated humans, and the accuracy bar is published.
  • Sample human review of AI-drafted comms quarterly. 5% of AI-drafted comms get human eyes every quarter. The sample is stratified to include the highest-stakes outputs. The review feeds two outputs: classifier accuracy assessment, and AI agent training data. Failures in the sample are flagged back to the agent's next training cycle.
  • Mandate human authorship for the highest-stakes comms. Executive announcements, reductions in force, compensation policy changes, harassment investigation outcomes, leadership transitions. AI may suggest; AI does not draft. The author is named. The author is accountable. The gate still applies.
  • Track AI-drafted comms separately from human-drafted in the gate dashboard. The function will want to see whether AI-drafted comms are passing the gate at the same rate as human-drafted, drifting, or improving. Separate tracking is the only way to answer that question. Aggregating loses the signal.
Failure modes
  • ×Letting AI draft RIF or compensation comms without a human author. This will produce a memo that is policy-coherent, well-structured, and emotionally illiterate. The recovery cost is months.
  • ×Classifier drift without retraining. The gate becomes ceremonial. First-time accept rates rise because the classifier got loose, not because the comms got better. The function celebrates a metric that has stopped meaning what it used to mean.
  • ×Treating auto-approved by classifier as identical to human-approved. The classifier is a filter, not a substitute. The 5% sample is the discipline that keeps it honest.
06
AI-systems analog

Talent calibration → AI-augmented calibration with bias auditing

AI scoring is a tool the calibrator uses, not a replacement for the calibrator.

Parent instrument

The parent paper added three signals to the calibration rubric: Purpose articulation under pressure, Same-Breath communication under stress, voluntary self-retirement of an owned workflow. Reviewers calibrate together; high-potential lists shift; comp catches up over two cycles.

AI-systems layer

When AI scores any of these signals, the function needs four operational disciplines that did not exist when humans scored everything. Pre-deployment bias audit on the AI scorer. Continuous drift monitoring on calibration outputs. Veto authority for the HR Trust & Safety team on what the scorer ships. A human override pathway with an audit trail; reviewers can override AI scores, the override pattern is data, and the override-rate trend is itself a signal of growing drift.

Operational moves
  • Pilot AI scoring on the signal with the cleanest ground truth. Same-Breath communication quality, scored from artifacts (memos, transcripts), is the cleanest because the artifact is the artifact. Purpose articulation under pressure is harder; self-retirement of a workflow is hardest. Start where the signal is cleanest, validate the AI scorer, and only extend to harder signals once the operating discipline is built.
  • Run quarterly comparisons between AI scores and calibrated human scores. The gap between AI and calibrated human is the signal of drift. A constant 5% gap in either direction is fine if calibrated; a growing gap is the signal to retrain. The comparison is operational, not academic. Reviewers see the gaps, the function uses them.
  • Publish disparate-impact metrics to leadership. The AI scorer's outputs split by demographic. The calibration is a public artifact within the leadership team. If demographic differences appear that human scores did not produce, the function knows. If the AI scorer is producing more equitable outputs than human scoring did, the function also knows; that is information the function should have.
  • Treat the human override rate as a leading indicator. Reviewer override rate going up over time without retraining is the signal of growing drift. Override rate spiking on a specific demographic is the signal of growing bias. The override rate is a free, continuously available signal; the function that does not watch it loses its earliest warning.
Failure modes
  • ×AI scoring without veto authority for HR Trust & Safety. Produces a calibration system that runs without a brake. The first incident is the brake.
  • ×Treating the AI as more objective than humans. It inherits biases from the training data, plus correlations the training data did not flag. Objective is what the AI looks like; biased in different ways is what it actually is.
  • ×Override rate going up over time without anyone noticing. Override is a manual workaround; if the workaround becomes the dominant mode, the AI scorer has stopped working and the function did not see it.
Cross-cutting themes

Four themes that run through all six analogs

The HR-eng-T&S triangle is the new unit of operation

All six AI-systems analogs require a partnership between HR (owns policy and operating discipline), engineering (builds the agent and the eval infrastructure), and Trust & Safety (audits bias, owns veto, runs redteam). No single function can ship the analog alone. The first investment is standing up the triangle as a real operating unit, with shared OKRs, shared launch reviews, and shared on-call.

The operating cadence speeds up by an order of magnitude

Human instruments run on quarterly review cycles. AI instruments run continuously. The dashboards, the alerts, the on-call rotations, the post-mortems: this is operations engineering applied to the People function. The function that runs AI instruments on quarterly cycles is running them at a hundredth of the resolution they require.

Veto authority on launches is the load-bearing institutional design

The thing that distinguishes a function that has built the AI-systems layer from one that has not is whether HR Trust & Safety has veto authority on agent launches. Not advisory input; veto. The function that ships agents over T&S objections has built the layer in name only. The first time T&S vetoes a launch and leadership backs the veto is when the layer becomes real.

Failure-mode rehearsal is part of the operating discipline

Each agent surface should have a documented failure-mode catalog: what the agent has failed at in the past, what it might fail at in the future, what the recovery looks like. The catalog is updated quarterly. New failures get added; recovered failures stay. The function rehearses recovery. The catalog is what separates the function that responds to its first incident in three days from the one that responds in three weeks.

A 24-month sequence

For a CHRO building the AI-systems layer

The sequence runs in parallel with the role-rebuild sequence in the Roles and Skills companion. The triangle stands up first; the easiest agent surface ships next; the discipline cadences come online over months 9-15; the harder agents come online in months 15-24 with the full discipline already in place.

Phase 0 · Months 0-3
Stand up the triangle, ship Module 7

Hire the senior HR Agent PM, HR Trust & Safety lead, HR Policy Engineer (already named in the Roles and Skills companion). Stand up the HR-eng-T&S triangle as a real operating unit. Ship Module 7 of the curriculum to senior leaders first. No agent ships in this phase.

Phase 1 · Months 3-9
Eval suite for the first agent surface, comms gate, agent telemetry

Build the eval suite for the first agent (likely the Leave-of-Absence agent or routine case routing, lowest political cost). Ship the AI comms-gate classifier before the AI drafting agent. Instrument the first agent surface with immune-system telemetry. The dashboard goes live with action thresholds.

Phase 2 · Months 9-15
Performance management and bias audit cadence

Quarterly agent performance reviews running. Quarterly bias audits running. Veto authority for HR Trust & Safety established and tested (the first veto is the test). Module 8 ships to the function. Cross-functional partnership with engineering and ML moves from artisanal to systemic.

Phase 3 · Months 15-24
AI-augmented calibration and the harder agents

Pilot AI scoring on Same-Breath communication quality. Quarterly comparisons against calibrated human scores. Disparate-impact metrics published to leadership. The harder agent surfaces (comp recommendation, perf signal, hi-po identification) come online with the full discipline already in place. The function operates the AI stack as routine, not heroically.

Open questions

What this paper does not yet answer

Five questions where the framework runs out of confident ground. Naming them so they are visible.

What is the right cadence for retraining agents in production?

Quarterly aligns with human review cycles but may be too slow for agents that drift faster than that. Monthly burns budget on retraining the agents that are not drifting. The right answer is probably per-agent, per-signal, with the cadence set by observed drift rate. The literature is thin and the practice is artisanal.

How does the function handle agents that span jurisdictions with different policy regimes?

An LOA agent operating in 50 jurisdictions has 50 sets of policy constraints. The eval suite has to handle each. The performance review has to detect jurisdiction-specific drift. The bias audit has to work across regulatory regimes that define protected classes differently. This is a paper of its own.

What is the right ratio of internal to external hires for the HR Agent PM and HR Trust & Safety roles?

The Roles and Skills companion noted T&S itself does not have a settled answer. The HR-specific answer is even less settled. The pattern from T&S is roughly 60-70% external, 30-40% internal mobility from product-adjacent roles, but whether that holds for HR is open.

How does the function instrument AI-mediated conversations without breaking trust?

Continuous telemetry from employee-facing AI is the highest-resolution signal the function will ever have. It is also the most invasive. The functions that run this layer well will be the ones that get the consent design right, not the ones that get the telemetry right. The consent-design problem deserves its own paper.

When does the function deprecate an agent rather than retrain?

An agent that has been below performance bar for two quarters and has not improved despite intervention should be pulled. The framework is in this paper. The hard part is the political reality: the engineering team that built the agent does not want it deprecated, and the function partner who pushed for the agent does not want to admit it was the wrong investment. The deprecation discipline is institutional, not technical, and the literature is thin.

Building the AI-systems layer?

The framework is operator-grade today. If you are designing the rebuild and the framing here lands, the next step is a conversation.

Catalyst question: Open Question 4 from AI Adaptive People Function. Author: Rahul Jindal. Companion to AI Adaptive People Function (Layer 5 of the Adaptive Org transformation stack).