TalkEdit/AI_dev_plan.md

# AI Dev Roadmap

## Purpose

This document defines how TalkEdit can evolve toward highly autonomous AI-driven implementation and debugging.

Goal: AI can execute most engineering work end-to-end with minimal human feedback while preserving safety, quality, and product intent.

## Scope

- Frontend: React + TypeScript + Vite
- Desktop host: Tauri
- Backend: FastAPI + Python services
- Media pipeline: FFmpeg, transcription, audio processing

## Autonomy Target

- Near-term target: 80-90% autonomous execution for well-scoped work.
- Mid-term target: 90-95% for low/medium-risk features with CI gates.
- 100% no-feedback autonomy is not realistic for ambiguous product decisions, legal/security tradeoffs, or high-risk migrations.

## Core Principles

1. Specs are executable and machine-readable.
2. Tests are the primary source of truth for completion.
3. Every failure is diagnosable from logs/artifacts.
4. AI has bounded permissions and policy guardrails.
5. AI updates docs and memory as part of done criteria.

## Execution Status (2026-04-15)

### Completed

1. Added roadmap companion docs:
   - `docs/spec-template.md`
   - `docs/ai-policy.md`
   - `docs/runbooks/error-codes.md`
2. Added operational scripts:
   - `scripts/validate-all.sh`
   - `scripts/collect-diagnostics.sh`
3. Ran Step 1 validation script (`./scripts/validate-all.sh`).
4. Ran Step 2 diagnostics script (`./scripts/collect-diagnostics.sh`).
5. Captured diagnostics archive:
   - `.diagnostics/diag_20260415_163239.tar.gz`
6. Renamed roadmap file to `AI_dev_plan.md`.

### Current Blockers

1. Frontend lint baseline is not green yet.
2. Remaining lint issues are mostly pre-existing unused vars and hook dependency warnings across app components.

### Next Actions

1. Triage existing lint findings into:
   - safe autofix
   - manual low-risk cleanup
   - intentional warnings to suppress with justification
2. Reach green `./scripts/validate-all.sh` in local dev.
3. Add CI workflow to enforce `validate-all` on pull requests.

## Roadmap Phases

## Phase 0: Foundation (1-2 weeks)

### Deliverables

1. Deterministic dev and test environment.
2. Baseline lint/type/test commands working in CI and local.
3. Standardized log format across frontend, backend, and Tauri host.

### Tasks

1. Stabilize toolchain commands:
   - frontend lint/typecheck/test
   - backend lint/typecheck/test
   - workspace e2e smoke command
2. Add a single script for local validation, for example `npm run validate:all`.
3. Introduce structured logging fields:
   - timestamp
   - request/job id
   - subsystem (frontend/backend/host)
   - error code
4. Add reproducible media fixtures for tests under a dedicated test-fixtures path.

### Exit Criteria

- Fresh clone can run validation with one command.
- CI produces deterministic pass/fail on clean branches.
- Failures include enough context to reproduce without manual guessing.

## Phase 1: Spec + Test Contracts (2-4 weeks)

### Deliverables

1. Feature spec template used for all new work.
2. API and schema contracts versioned and validated.
3. Regression harness for previous bugs.

### Tasks

1. Create `docs/spec-template.md` with required sections:
   - user story
   - acceptance criteria
   - non-goals
   - edge cases
   - rollback behavior
2. Add contract tests for backend routers:
   - transcribe
   - export
   - captions
   - audio
3. Add project schema validation tests for `shared/project-schema.json` and project load/save behavior.
4. For each resolved bug, add a regression test before closing issue.

### Exit Criteria

- New feature PRs must include spec and tests.
- Breaking contract changes are detected automatically in CI.

## Phase 2: Observability and Self-Debugging (2-3 weeks)

### Deliverables

1. Unified diagnostics bundle command.
2. AI-readable failure artifacts from CI and local runs.
3. Error taxonomy and runbook mapping.

### Tasks

1. Implement diagnostics command to collect:
   - frontend logs
   - backend logs
   - Tauri logs
   - failing test outputs
   - environment metadata
2. Define error codes for common classes:
   - media decode
   - FFmpeg pipeline
   - transcription model
   - project load/save
   - network/IPC bridge
3. Add runbook table mapping error codes to probable causes and first fixes.

### Exit Criteria

- Agent can identify likely root cause from artifacts without asking for manual logs.
- 80%+ of recurring failures map to known error classes.

## Phase 3: Controlled Autonomous Implementation (3-5 weeks)

### Deliverables

1. Policy file defining what AI can edit/run without approval.
2. Autonomous task loop for implement -> validate -> fix -> revalidate.
3. Automatic PR summary with risk and assumptions.

### Tasks

1. Add policy file (for example `docs/ai-policy.md`):
   - allowed directories for autonomous edits
   - blocked files requiring approval
   - blocked commands
2. Add task template for AI execution:
   - parse feature spec
   - locate impacted modules
   - implement smallest changes
   - run validation suite
   - retry up to N fix cycles
   - produce summary + residual risks
3. Require AI to update:
   - copilot instructions
   - changelog/roadmap note
   - regression tests when bugfixing

### Exit Criteria

- Low-risk feature tasks complete end-to-end without human intervention.
- CI gate pass rate for autonomous PRs remains above agreed threshold (for example 95%).

## Phase 4: High-Autonomy with Human Escalation (ongoing)

### Deliverables

1. Explicit escalation triggers for ambiguity and risk.
2. Broader autonomous scope with mandatory gates.
3. Drift monitoring for quality, velocity, and regressions.

### Tasks

1. Define escalation triggers:
   - user-visible behavior changes without clear spec
   - API/schema breakage
   - security-sensitive modifications
   - destructive migrations
2. Add quality dashboards:
   - flaky tests
   - escaped defects
   - mean time to recovery
   - autonomous task success rate
3. Monthly calibration:
   - adjust autonomy scope
   - update policies
   - prune stale runbooks and memories

### Exit Criteria

- Autonomous throughput increases while defect rate stays stable or improves.
- Human review focuses on strategy and product decisions, not routine implementation/debugging.

## Required Engineering Systems

## 1. Spec System

Minimum implementation:

1. `docs/spec-template.md`
2. `docs/specs/` folder with one file per feature
3. CI check that new feature PRs include a spec reference

## 2. Test System

Minimum implementation:

1. Frontend unit tests for stores/components/hook logic.
2. Backend unit+integration tests for routers/services.
3. E2E smoke tests for core workflow:
   - open media
   - transcribe
   - edit zones
   - export
4. Regression tests required for every bugfix.

## 3. Environment System

Minimum implementation:

1. Locked dependencies and pinned runtimes.
2. Single bootstrap script.
3. Fixture media files for deterministic test runs.

## 4. Observability System

Minimum implementation:

1. Structured logs.
2. Standard error codes.
3. Diagnostics bundle command.
4. CI artifact retention for failed runs.

## 5. Governance System

Minimum implementation:

1. Protected branch + required checks.
2. Secret and dependency scanning.
3. Policy-based approval requirements for high-risk changes.

## Suggested Repository Additions

1. `AI_dev_plan.md` (this file)
2. `docs/spec-template.md`
3. `docs/ai-policy.md`
4. `docs/runbooks/error-codes.md`
5. `docs/runbooks/debug-playbooks.md`
6. `scripts/validate-all.sh`
7. `scripts/collect-diagnostics.sh`

## Definition of Done for Autonomous Tasks

A task is complete only if all items pass:

1. Feature spec acceptance criteria satisfied.
2. Relevant tests added/updated and passing.
3. No lint/type errors in changed scope.
4. Docs and instructions updated if behavior changed.
5. Risk summary and assumptions recorded.

## Escalation Rules (Must Ask Human)

AI must stop and ask when:

1. Requirement ambiguity changes user-visible behavior.
2. Multiple valid product decisions exist without clear preference.
3. Security/privacy/compliance implications are uncertain.
4. Data loss or destructive migration is possible.
5. CI remains failing after bounded auto-fix attempts.

## Metrics to Track

1. Autonomous task success rate.
2. Reopen rate of AI-completed tasks.
3. Regression rate per release.
4. Flaky test percentage.
5. Mean time to diagnose and resolve failures.

## 30-Day Execution Plan

Week 1:

1. Baseline scripts and deterministic environment.
2. Restore lint/test commands to green status.
3. Add structured logging and IDs.

Week 2:

1. Spec template and mandatory spec policy.
2. Contract tests for core backend routes.
3. First diagnostics bundle version.

Week 3:

1. AI policy and bounded autonomous edit/run loop.
2. Regression-test-first bugfix workflow.
3. CI artifact enrichment and runbook mapping.

Week 4:

1. Pilot autonomous feature tasks in low-risk areas.
2. Measure success/failure patterns.
3. Expand scope only if quality gates hold.

## Notes for TalkEdit

1. Keep router files thin and service logic isolated to improve AI edit precision.
2. Preserve compatibility in desktop bridge contracts to avoid frontend breakage.
3. Treat export/transcription pipeline changes as high-risk and always require regression tests.
4. Keep Linux WebKit startup and media URL consistency as explicit regression targets.
improved tools for ai 2026-04-15 16:36:21 -06:00			`# AI Dev Roadmap`

			`## Purpose`

			`This document defines how TalkEdit can evolve toward highly autonomous AI-driven implementation and debugging.`

			`Goal: AI can execute most engineering work end-to-end with minimal human feedback while preserving safety, quality, and product intent.`

			`## Scope`

			`- Frontend: React + TypeScript + Vite`
			`- Desktop host: Tauri`
			`- Backend: FastAPI + Python services`
			`- Media pipeline: FFmpeg, transcription, audio processing`

			`## Autonomy Target`

			`- Near-term target: 80-90% autonomous execution for well-scoped work.`
			`- Mid-term target: 90-95% for low/medium-risk features with CI gates.`
			`- 100% no-feedback autonomy is not realistic for ambiguous product decisions, legal/security tradeoffs, or high-risk migrations.`

			`## Core Principles`

			`1. Specs are executable and machine-readable.`
			`2. Tests are the primary source of truth for completion.`
			`3. Every failure is diagnosable from logs/artifacts.`
			`4. AI has bounded permissions and policy guardrails.`
			`5. AI updates docs and memory as part of done criteria.`

			`## Execution Status (2026-04-15)`

			`### Completed`

			`1. Added roadmap companion docs:`
			- `docs/spec-template.md`
			- `docs/ai-policy.md`
			- `docs/runbooks/error-codes.md`
			`2. Added operational scripts:`
			- `scripts/validate-all.sh`
			- `scripts/collect-diagnostics.sh`
			3. Ran Step 1 validation script (`./scripts/validate-all.sh`).
			4. Ran Step 2 diagnostics script (`./scripts/collect-diagnostics.sh`).
			`5. Captured diagnostics archive:`
			- `.diagnostics/diag_20260415_163239.tar.gz`
			6. Renamed roadmap file to `AI_dev_plan.md`.

			`### Current Blockers`

			`1. Frontend lint baseline is not green yet.`
			`2. Remaining lint issues are mostly pre-existing unused vars and hook dependency warnings across app components.`

			`### Next Actions`

			`1. Triage existing lint findings into:`
			`- safe autofix`
			`- manual low-risk cleanup`
			`- intentional warnings to suppress with justification`
			2. Reach green `./scripts/validate-all.sh` in local dev.
			3. Add CI workflow to enforce `validate-all` on pull requests.

			`## Roadmap Phases`

			`## Phase 0: Foundation (1-2 weeks)`

			`### Deliverables`

			`1. Deterministic dev and test environment.`
			`2. Baseline lint/type/test commands working in CI and local.`
			`3. Standardized log format across frontend, backend, and Tauri host.`

			`### Tasks`

			`1. Stabilize toolchain commands:`
			`- frontend lint/typecheck/test`
			`- backend lint/typecheck/test`
			`- workspace e2e smoke command`
			2. Add a single script for local validation, for example `npm run validate:all`.
			`3. Introduce structured logging fields:`
			`- timestamp`
			`- request/job id`
			`- subsystem (frontend/backend/host)`
			`- error code`
			`4. Add reproducible media fixtures for tests under a dedicated test-fixtures path.`

			`### Exit Criteria`

			`- Fresh clone can run validation with one command.`
			`- CI produces deterministic pass/fail on clean branches.`
			`- Failures include enough context to reproduce without manual guessing.`

			`## Phase 1: Spec + Test Contracts (2-4 weeks)`

			`### Deliverables`

			`1. Feature spec template used for all new work.`
			`2. API and schema contracts versioned and validated.`
			`3. Regression harness for previous bugs.`

			`### Tasks`

			1. Create `docs/spec-template.md` with required sections:
			`- user story`
			`- acceptance criteria`
			`- non-goals`
			`- edge cases`
			`- rollback behavior`
			`2. Add contract tests for backend routers:`
			`- transcribe`
			`- export`
			`- captions`
			`- audio`
			3. Add project schema validation tests for `shared/project-schema.json` and project load/save behavior.
			`4. For each resolved bug, add a regression test before closing issue.`

			`### Exit Criteria`

			`- New feature PRs must include spec and tests.`
			`- Breaking contract changes are detected automatically in CI.`

			`## Phase 2: Observability and Self-Debugging (2-3 weeks)`

			`### Deliverables`

			`1. Unified diagnostics bundle command.`
			`2. AI-readable failure artifacts from CI and local runs.`
			`3. Error taxonomy and runbook mapping.`

			`### Tasks`

			`1. Implement diagnostics command to collect:`
			`- frontend logs`
			`- backend logs`
			`- Tauri logs`
			`- failing test outputs`
			`- environment metadata`
			`2. Define error codes for common classes:`
			`- media decode`
			`- FFmpeg pipeline`
			`- transcription model`
			`- project load/save`
			`- network/IPC bridge`
			`3. Add runbook table mapping error codes to probable causes and first fixes.`

			`### Exit Criteria`

			`- Agent can identify likely root cause from artifacts without asking for manual logs.`
			`- 80%+ of recurring failures map to known error classes.`

			`## Phase 3: Controlled Autonomous Implementation (3-5 weeks)`

			`### Deliverables`

			`1. Policy file defining what AI can edit/run without approval.`
			`2. Autonomous task loop for implement -> validate -> fix -> revalidate.`
			`3. Automatic PR summary with risk and assumptions.`

			`### Tasks`

			1. Add policy file (for example `docs/ai-policy.md`):
			`- allowed directories for autonomous edits`
			`- blocked files requiring approval`
			`- blocked commands`
			`2. Add task template for AI execution:`
			`- parse feature spec`
			`- locate impacted modules`
			`- implement smallest changes`
			`- run validation suite`
			`- retry up to N fix cycles`
			`- produce summary + residual risks`
			`3. Require AI to update:`
			`- copilot instructions`
			`- changelog/roadmap note`
			`- regression tests when bugfixing`

			`### Exit Criteria`

			`- Low-risk feature tasks complete end-to-end without human intervention.`
			`- CI gate pass rate for autonomous PRs remains above agreed threshold (for example 95%).`

			`## Phase 4: High-Autonomy with Human Escalation (ongoing)`

			`### Deliverables`

			`1. Explicit escalation triggers for ambiguity and risk.`
			`2. Broader autonomous scope with mandatory gates.`
			`3. Drift monitoring for quality, velocity, and regressions.`

			`### Tasks`

			`1. Define escalation triggers:`
			`- user-visible behavior changes without clear spec`
			`- API/schema breakage`
			`- security-sensitive modifications`
			`- destructive migrations`
			`2. Add quality dashboards:`
			`- flaky tests`
			`- escaped defects`
			`- mean time to recovery`
			`- autonomous task success rate`
			`3. Monthly calibration:`
			`- adjust autonomy scope`
			`- update policies`
			`- prune stale runbooks and memories`

			`### Exit Criteria`

			`- Autonomous throughput increases while defect rate stays stable or improves.`
			`- Human review focuses on strategy and product decisions, not routine implementation/debugging.`

			`## Required Engineering Systems`

			`## 1. Spec System`

			`Minimum implementation:`

			1. `docs/spec-template.md`
			2. `docs/specs/` folder with one file per feature
			`3. CI check that new feature PRs include a spec reference`

			`## 2. Test System`

			`Minimum implementation:`

			`1. Frontend unit tests for stores/components/hook logic.`
			`2. Backend unit+integration tests for routers/services.`
			`3. E2E smoke tests for core workflow:`
			`- open media`
			`- transcribe`
			`- edit zones`
			`- export`
			`4. Regression tests required for every bugfix.`

			`## 3. Environment System`

			`Minimum implementation:`

			`1. Locked dependencies and pinned runtimes.`
			`2. Single bootstrap script.`
			`3. Fixture media files for deterministic test runs.`

			`## 4. Observability System`

			`Minimum implementation:`

			`1. Structured logs.`
			`2. Standard error codes.`
			`3. Diagnostics bundle command.`
			`4. CI artifact retention for failed runs.`

			`## 5. Governance System`

			`Minimum implementation:`

			`1. Protected branch + required checks.`
			`2. Secret and dependency scanning.`
			`3. Policy-based approval requirements for high-risk changes.`

			`## Suggested Repository Additions`

			1. `AI_dev_plan.md` (this file)
			2. `docs/spec-template.md`
			3. `docs/ai-policy.md`
			4. `docs/runbooks/error-codes.md`
			5. `docs/runbooks/debug-playbooks.md`
			6. `scripts/validate-all.sh`
			7. `scripts/collect-diagnostics.sh`

			`## Definition of Done for Autonomous Tasks`

			`A task is complete only if all items pass:`

			`1. Feature spec acceptance criteria satisfied.`
			`2. Relevant tests added/updated and passing.`
			`3. No lint/type errors in changed scope.`
			`4. Docs and instructions updated if behavior changed.`
			`5. Risk summary and assumptions recorded.`

			`## Escalation Rules (Must Ask Human)`

			`AI must stop and ask when:`

			`1. Requirement ambiguity changes user-visible behavior.`
			`2. Multiple valid product decisions exist without clear preference.`
			`3. Security/privacy/compliance implications are uncertain.`
			`4. Data loss or destructive migration is possible.`
			`5. CI remains failing after bounded auto-fix attempts.`

			`## Metrics to Track`

			`1. Autonomous task success rate.`
			`2. Reopen rate of AI-completed tasks.`
			`3. Regression rate per release.`
			`4. Flaky test percentage.`
			`5. Mean time to diagnose and resolve failures.`

			`## 30-Day Execution Plan`

			`Week 1:`

			`1. Baseline scripts and deterministic environment.`
			`2. Restore lint/test commands to green status.`
			`3. Add structured logging and IDs.`

			`Week 2:`

			`1. Spec template and mandatory spec policy.`
			`2. Contract tests for core backend routes.`
			`3. First diagnostics bundle version.`

			`Week 3:`

			`1. AI policy and bounded autonomous edit/run loop.`
			`2. Regression-test-first bugfix workflow.`
			`3. CI artifact enrichment and runbook mapping.`

			`Week 4:`

			`1. Pilot autonomous feature tasks in low-risk areas.`
			`2. Measure success/failure patterns.`
			`3. Expand scope only if quality gates hold.`

			`## Notes for TalkEdit`

			`1. Keep router files thin and service logic isolated to improve AI edit precision.`
			`2. Preserve compatibility in desktop bridge contracts to avoid frontend breakage.`
			`3. Treat export/transcription pipeline changes as high-risk and always require regression tests.`
			`4. Keep Linux WebKit startup and media URL consistency as explicit regression targets.`