TalkEdit/FEATURES.md

# TalkEdit — Feature Roadmap

Features are grouped by priority. Check off items as they are implemented.

---

## 🔴 High Priority — Core editing gaps

- [x] [#001] **Cut / Mute sections** — select a time range and choose to cut (remove entirely) or mute (silence audio while video continues). Cut sections show as red overlays, mute sections as transparent blue overlays on the timeline over the transcript text and audio waveform. Backend: `ffmpeg -af volume=0` for mute, time-based cutting for removal.

- [x] [#002] **Silence / pause trimmer (in progress)** — detect pauses using min duration (ms) + amplitude threshold (dB), then apply detected pauses as cut ranges. Initial endpoint: `/audio/detect-silence`; UI includes filter controls and an "Apply As Cuts" action.

- [x] [#003] **Operation-level undo for batch actions** — explicit undo entry for actions like "Apply Silence Trim" so one shortcut/click reverts the whole operation, while still allowing normal fine-grained undo/redo steps.

- [x] [#004] **Grouped silence-trim zones (editable batch)** — when pauses are applied, tag them as a batch (`trim_group_id`) so the user can: (1) delete all zones from that auto-trim pass at once, and (2) still select/resize/delete individual zones independently.

- [x] [#005] **Edit silence-trim group settings after apply** — allow reopening a trim group and changing its detection settings (min pause ms, threshold dB, pre/post buffers), then reapplying updates to that group without affecting unrelated edits.

- [x] [#006] **Volume / gain control** — per-selection or global audio gain slider. Every editor has this. Descript users constantly complain it's missing. Backend: `ffmpeg -af volume=Xdb`.

- [x] [#007] **Speed adjustment (4th zone type)** — add speed zones as the fourth editable timeline/transcript zone type (after cut, mute, gain), allowing slow/fast playback per range or globally. Backend: `ffmpeg -filter:v setpts` + `atempo`. Common use case: slightly speed up boring sections.

- [x] [#008] **Cut preview** — before committing a delete, play what the audio will sound like with that section removed (pre-listen across the edit point). Pure frontend using Web Audio API — splice the AudioBuffer and play the join.

- [x] [#009] **Timeline shows output length** — deleted regions should visually collapse (or show as narrow gaps) so the user sees the *output* duration, not just the source duration.

---

## 🟡 Medium Priority — Widely expected features

- [x] [#010] **Transcript search (Ctrl+F)** — find words/phrases in the transcript and highlight matches. Pure frontend. Critical for long-form content. Jump between matches with Enter.

- [x] [#011] **Mark In / Out + delete (I / O keys)** — keyboard shortcuts to mark a time range on the timeline, then delete it. Faster than click-dragging words. Store the in/out points in state, `Delete` removes them.

- [ ] [#012] **Low-confidence word highlighting** — WhisperX already returns `confidence` per word. Words below a threshold (e.g. < 0.6) should be visually underlined or tinted so the user knows where to double-check.

- [ ] [#013] **Re-transcribe selection** — if Whisper gets a section wrong, let the user select a word range and re-run transcription on just that segment (optionally with a different model or language).

- [ ] [#014] **Optional VibeVoice-ASR-HF transcription backend (future)** — evaluate as an alternate transcription mode for long-form, speaker-attributed transcripts. Keep WhisperX as the default for word-level timestamp editing.

- [ ] [#015] **Word text correction** — allow editing the transcript text of a word without affecting its timing. Whisper gets homophones/proper nouns wrong constantly. Pure frontend state change; no backend needed.

- [ ] [#016] **Named timeline markers** — drop named marker pins on the waveform (like Resolve markers). Store as `{ id, time, label, color }` in the project. Rendered as colored triangles on the timeline canvas.

- [ ] [#017] **Chapters** — group markers into named chapter ranges. Useful for podcasts and lectures. Exportable as YouTube chapter timestamps in the description.

- [ ] [#041] **Customizable hotkeys / keymap editor (left-hand focused)** — allow users to view, remap, and reset keyboard shortcuts (transport, edit, save/export, zone tools), with a default preset optimized for left-hand reach (Q/W/E/R/A/S/D/F/Z/X/C/V + modifiers). Include conflict detection, an alternate standard preset, and one-click "restore defaults".

---

## 🟢 Lower Priority — Differentiating / power features

- [ ] [#018] **Audio normalization / loudness targeting** — single "Normalize" button that targets a LUFS level (-14 for YouTube, -16 for Spotify). Backend: `ffmpeg -af loudnorm`. Very high value for podcasters, ~2–3 hours of work.

- [ ] [#019] **Background music track** — a second audio track for background music with volume ducking. Major gap in Descript that TalkEdit could own. Backend: `ffmpeg` amix + `asendcmd` for auto-ducking.

- [ ] [#020] **Video zoom / punch-in** — scale and position the video (crop, zoom, pan). Used constantly on talking-head videos for emphasis. Backend: `ffmpeg -vf crop/scale/zoompan`.

- [ ] [#021] **Multi-clip / append** — load a second video and append it to the timeline. Even without a full multi-track timeline, "append clip" is a heavily used workflow.

- [ ] [#022] **Clip thumbnail strip** — video frame thumbnails along the timeline so users can navigate visually, not only by waveform. Backend: `ffmpeg` thumbnail extraction at regular intervals.

- [ ] [#023] **Batch silence removal** — full-file scan + remove all pauses above threshold in one click. Distinct from the manual trimmer above; this is a "fix the whole file" operation.

- [ ] [#024] **Export to transcript text / SRT only** — some users just want a clean `.txt` or `.srt` of the edited transcript without rendering video.

---

## 💡 TalkEdit competitive advantages to lean into

These aren't features to build — they're things to make more visible in the UI and README:

- **100% offline / no account required** — CapCut requires login and sends data to servers. Descript is cloud-first. TalkEdit never leaves the machine.
- **Local AI models** — Ollama support means no API costs and no data leaving the device.
- **Word-level precision** — editing by deleting words (not dragging razor cuts) is faster for talking-head content than any timeline-based editor.
- **Works on long files** — virtualized transcript + chunked waveform handles 1hr+ content that bogs down CapCut.

---

## ✅ Already Implemented

- [#025] Word-level transcript editing (select, drag, shift-click, delete)
- [#026] Ctrl+click word → seek timeline to that position
- [#027] Waveform timeline with zoom (Ctrl+scroll), scroll, drag-to-scrub playhead
- [#028] Auto-scroll waveform when playhead goes off-screen
- [#029] AI filler word detection and removal (Ollama / OpenAI / Claude)
- [#030] AI clip suggestions for social media
- [#031] Noise reduction (DeepFilterNet or FFmpeg ANLMDN)
- [#032] Export: fast stream-copy or full reencode (MP4/MOV/WebM, 720p/1080p/4K)
- [#033] Captions: SRT, VTT, ASS burn-in with font/color/position options
- [#034] Speaker diarization
- [#035] Project save / load (.aive JSON format)
- [#036] Undo / redo (100-level history via Zundo)
- [#037] Multi-format input (MP4, MKV, MOV, AVI, WebM, M4A)
- [#038] Keyboard shortcuts (Space, J/K/L, arrows, Ctrl+Z/Shift+Z, Ctrl+S, Ctrl+E)
- [#039] Settings panel: AI provider config (Ollama, OpenAI, Claude)
- [#040] Cut/mute range creation on timeline with draggable zone edits and Delete-to-remove
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
+								# TalkEdit — Feature Roadmap
 								Features are grouped by priority. Check off items as they are implemented.
 								---
 								## 🔴 High Priority — Core editing gaps
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#001] **Cut / Mute sections** — select a time range and choose to cut (remove entirely) or mute (silence audio while video continues). Cut sections show as red overlays, mute sections as transparent blue overlays on the timeline over the transcript text and audio waveform. Backend: `ffmpeg -af volume=0` for mute, time-based cutting for removal.
-												added cut and mute zones

											
										
										
											2026-04-03 11:14:31 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#002] **Silence / pause trimmer (in progress)** — detect pauses using min duration (ms) + amplitude threshold (dB), then apply detected pauses as cut ranges. Initial endpoint: `/audio/detect-silence`; UI includes filter controls and an "Apply As Cuts" action.
-												silence trimmer

											
										
										
											2026-04-03 12:05:44 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#003] **Operation-level undo for batch actions** — explicit undo entry for actions like "Apply Silence Trim" so one shortcut/click reverts the whole operation, while still allowing normal fine-grained undo/redo steps.
-												silence trimmer

											
										
										
											2026-04-03 12:05:44 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#004] **Grouped silence-trim zones (editable batch)** — when pauses are applied, tag them as a batch (`trim_group_id`) so the user can: (1) delete all zones from that auto-trim pass at once, and (2) still select/resize/delete individual zones independently.
-												silence trimmer

											
										
										
											2026-04-03 12:05:44 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#005] **Edit silence-trim group settings after apply** — allow reopening a trim group and changing its detection settings (min pause ms, threshold dB, pre/post buffers), then reapplying updates to that group without affecting unrelated edits.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [x] [#006] **Volume / gain control** — per-selection or global audio gain slider. Every editor has this. Descript users constantly complain it's missing. Backend: `ffmpeg -af volume=Xdb`.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												zone previews

											
										
										
											2026-04-15 20:27:24 -06:00
+								- [x] [#007] **Speed adjustment (4th zone type)** — add speed zones as the fourth editable timeline/transcript zone type (after cut, mute, gain), allowing slow/fast playback per range or globally. Backend: `ffmpeg -filter:v setpts` + `atempo`. Common use case: slightly speed up boring sections.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												added save as

											
										
										
											2026-04-15 20:51:24 -06:00
+								- [x] [#008] **Cut preview** — before committing a delete, play what the audio will sound like with that section removed (pre-listen across the edit point). Pure frontend using Web Audio API — splice the AudioBuffer and play the join.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												added save as

											
										
										
											2026-04-15 20:51:24 -06:00
+								- [x] [#009] **Timeline shows output length** — deleted regions should visually collapse (or show as narrow gaps) so the user sees the *output* duration, not just the source duration.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
 								---
 								## 🟡 Medium Priority — Widely expected features
-												feature 10,11

											
										
										
											2026-04-15 20:57:43 -06:00
+								- [x] [#010] **Transcript search (Ctrl+F)** — find words/phrases in the transcript and highlight matches. Pure frontend. Critical for long-form content. Jump between matches with Enter.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												feature 10,11

											
										
										
											2026-04-15 20:57:43 -06:00
+								- [x] [#011] **Mark In / Out + delete (I / O keys)** — keyboard shortcuts to mark a time range on the timeline, then delete it. Faster than click-dragging words. Store the in/out points in state, `Delete` removes them.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#012] **Low-confidence word highlighting** — WhisperX already returns `confidence` per word. Words below a threshold (e.g. < 0.6) should be visually underlined or tinted so the user knows where to double-check.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#013] **Re-transcribe selection** — if Whisper gets a section wrong, let the user select a word range and re-run transcription on just that segment (optionally with a different model or language).
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#014] **Optional VibeVoice-ASR-HF transcription backend (future)** — evaluate as an alternate transcription mode for long-form, speaker-attributed transcripts. Keep WhisperX as the default for word-level timestamp editing.
-												trying to fix bug

											
										
										
											2026-04-09 01:36:28 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#015] **Word text correction** — allow editing the transcript text of a word without affecting its timing. Whisper gets homophones/proper nouns wrong constantly. Pure frontend state change; no backend needed.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#016] **Named timeline markers** — drop named marker pins on the waveform (like Resolve markers). Store as `{ id, time, label, color }` in the project. Rendered as colored triangles on the timeline canvas.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#017] **Chapters** — group markers into named chapter ranges. Useful for podcasts and lectures. Exportable as YouTube chapter timestamps in the description.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												improved zone handling

											
										
										
											2026-04-15 18:00:34 -06:00
+								- [ ] [#041] **Customizable hotkeys / keymap editor (left-hand focused)** — allow users to view, remap, and reset keyboard shortcuts (transport, edit, save/export, zone tools), with a default preset optimized for left-hand reach (Q/W/E/R/A/S/D/F/Z/X/C/V + modifiers). Include conflict detection, an alternate standard preset, and one-click "restore defaults".
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
+								---
 								## 🟢 Lower Priority — Differentiating / power features
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#018] **Audio normalization / loudness targeting** — single "Normalize" button that targets a LUFS level (-14 for YouTube, -16 for Spotify). Backend: `ffmpeg -af loudnorm`. Very high value for podcasters, ~2–3 hours of work.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#019] **Background music track** — a second audio track for background music with volume ducking. Major gap in Descript that TalkEdit could own. Backend: `ffmpeg` amix + `asendcmd` for auto-ducking.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#020] **Video zoom / punch-in** — scale and position the video (crop, zoom, pan). Used constantly on talking-head videos for emphasis. Backend: `ffmpeg -vf crop/scale/zoompan`.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#021] **Multi-clip / append** — load a second video and append it to the timeline. Even without a full multi-track timeline, "append clip" is a heavily used workflow.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#022] **Clip thumbnail strip** — video frame thumbnails along the timeline so users can navigate visually, not only by waveform. Backend: `ffmpeg` thumbnail extraction at regular intervals.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#023] **Batch silence removal** — full-file scan + remove all pauses above threshold in one click. Distinct from the manual trimmer above; this is a "fix the whole file" operation.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [ ] [#024] **Export to transcript text / SRT only** — some users just want a clean `.txt` or `.srt` of the edited transcript without rendering video.
-												added distil models

											
										
										
											2026-04-03 10:25:48 -06:00
 								---
 								## 💡 TalkEdit competitive advantages to lean into
 								These aren't features to build — they're things to make more visible in the UI and README:
 								- **100% offline / no account required** — CapCut requires login and sends data to servers. Descript is cloud-first. TalkEdit never leaves the machine.
 								- **Local AI models** — Ollama support means no API costs and no data leaving the device.
 								- **Word-level precision** — editing by deleting words (not dragging razor cuts) is faster for talking-head content than any timeline-based editor.
 								- **Works on long files** — virtualized transcript + chunked waveform handles 1hr+ content that bogs down CapCut.
 								---
 								## ✅ Already Implemented
-												volume panel; copilot instructions

											
										
										
											2026-04-15 16:10:35 -06:00
+								- [#025] Word-level transcript editing (select, drag, shift-click, delete)
 								- [#026] Ctrl+click word → seek timeline to that position
 								- [#027] Waveform timeline with zoom (Ctrl+scroll), scroll, drag-to-scrub playhead
 								- [#028] Auto-scroll waveform when playhead goes off-screen
 								- [#029] AI filler word detection and removal (Ollama / OpenAI / Claude)
 								- [#030] AI clip suggestions for social media
 								- [#031] Noise reduction (DeepFilterNet or FFmpeg ANLMDN)
 								- [#032] Export: fast stream-copy or full reencode (MP4/MOV/WebM, 720p/1080p/4K)
 								- [#033] Captions: SRT, VTT, ASS burn-in with font/color/position options
 								- [#034] Speaker diarization
 								- [#035] Project save / load (.aive JSON format)
 								- [#036] Undo / redo (100-level history via Zundo)
 								- [#037] Multi-format input (MP4, MKV, MOV, AVI, WebM, M4A)
 								- [#038] Keyboard shortcuts (Space, J/K/L, arrows, Ctrl+Z/Shift+Z, Ctrl+S, Ctrl+E)
 								- [#039] Settings panel: AI provider config (Ollama, OpenAI, Claude)
 								- [#040] Cut/mute range creation on timeline with draggable zone edits and Delete-to-remove