Section 29.3: Internal Standards and Governance

"Standards are not bureaucracy. They are accumulated wisdom that prevents repeating avoidable mistakes."
Principal Engineer Who Wrote Too Many Post-Mortems

Purpose of Internal Standards

As organizations scale AI product development, repeated decisions emerge. How should teams evaluate AI quality? What documentation is required before launch? How should AI failures be classified and responded to? Without shared standards, each team invents its own approaches, leading to inconsistency, knowledge loss, and preventable failures.

Internal standards capture organizational learning and make it reusable. They reduce decision fatigue, enable cross-team collaboration, and provide a foundation for governance.

Evaluation Standards

Chapter 21 introduced evaluation frameworks. This section covers organizational standards for evaluation practice.

Baseline Evaluation Standards

Define minimum evaluation requirements for AI features:

Minimum Eval Requirements

Unit evals: Every AI feature must have automated tests that verify core behaviors (evaluation)

Integration evals: Every AI feature must have tests that verify behavior in realistic workflows

User-facing evals: Every AI feature must have human evaluation of sample outputs

Production monitoring: Every production AI feature must have metrics that detect degradation

Metric Standards

Establish standard metrics for common AI task types. Standard metrics enable cross-team comparison and shared dashboards. For classification, use accuracy, precision, recall, and F1 by category. For generation, use task completion rate, quality ratings, and human preference. For extraction, use entity-level F1 and relationship accuracy. For recommendation, use click-through rate, conversion rate, and engagement metrics.

Standard metrics enable cross-team comparison and shared dashboards.

Documentation Standards

AI features require documentation that traditional software does not:

Model and System Cards

Model cards document AI system characteristics for internal and external stakeholders:

AI features require documentation that traditional software does not. Model cards document AI system characteristics for internal and external stakeholders. The documentation covers intended use describing what the system is designed to do, known limitations describing where the system struggles or may fail, training data describing what data the system learned from, performance characteristics describing how well the system performs on standard benchmarks, and failure modes describing known failure patterns and their frequency.

AI Feature Launch Checklist

Require completion of standard checklist before AI feature launch. The checklist includes defining and automating the eval framework, ensuring eval results meet minimum thresholds, completing human evaluation, identifying and documenting failure modes, defining a rollback plan, training the support team on common issues, establishing monitoring dashboards, and completing model card or system documentation.

Practical Example: DataForge Launch Checklist Implementation

Who: DataForge engineering leadership standardizing AI feature launches

Situation: Teams were launching AI features with inconsistent evaluation, leading to quality issues that reached customers

Problem: No common standard for what "good enough" meant or what verification was required

Solution: Created mandatory launch checklist with security, legal, and engineering requirements. Integrated into CI/CD pipeline so features cannot deploy without checklist approval.

Result: Customer-facing AI incidents dropped 70% in first year. Cross-team collaboration improved as teams shared evaluation approaches.

Lesson: Standards embedded in tooling get followed. Make compliance the path of least resistance.

Governance Frameworks

Governance ensures AI products meet organizational standards for quality, safety, and ethics. Effective governance is proportionate to risk and enables rather than blocks development.

Tiered Governance by Risk

Governance ensures AI products meet organizational standards for quality, safety, and ethics. Effective governance is proportionate to risk and enables rather than blocks development. Apply governance controls proportionate to AI impact. Low-risk AI follows standard development practices, automated testing, and monitoring. Medium-risk AI requires additional human review, enhanced documentation, and staged rollout. High-risk AI requires mandatory review board, extensive testing, limited rollout, and ongoing monitoring.

AI Risk Classification

Classify AI risk along two dimensions. Decision consequence asks what the impact is if the AI makes a wrong decision. Reversibility asks whether wrong decisions can be easily reversed or corrected.

Risk Matrix

High consequence + Low reversibility: Highest risk. Requires extensive validation, human oversight, and monitoring. Examples: medical diagnosis, financial decisions, legal advice.

High consequence + High reversibility: High risk but manageable. Requires validation and monitoring but can iterate faster. Examples: content recommendations, search ranking.

Low consequence: Lower risk. Standard practices apply. Examples: auto-complete, spam detection, image tagging.

AI Review Boards

Chapter 28 introduced AI review boards as organizational structures. This section covers the standards and processes they apply.

Review Board Charter

Define the scope and authority of review boards clearly. Scope determines which AI features require review and what triggers mandatory review. Authority determines whether the board can block launches and what escalation path exists. Membership determines who serves on the board and what expertise is required. Process determines how review works and what the timeline is.

Review Criteria

Standardize what review boards evaluate. Safety asks whether this AI could cause harm if it fails. Fairness asks whether this AI could perform differently across user groups. Privacy asks whether this AI handles data appropriately. Transparency asks whether users can understand when they are interacting with AI. Accuracy asks whether evaluation demonstrates acceptable quality.