Files
TalkEdit/FEATURES.md
2026-04-03 10:25:48 -06:00

5.4 KiB
Raw Blame History

TalkEdit — Feature Roadmap

Features are grouped by priority. Check off items as they are implemented.


🔴 High Priority — Core editing gaps

  • Silence / pause trimmer — detect and auto-remove pauses longer than X ms. One backend endpoint (/audio/remove-silence) + a button in the UI. Saves enormous time in podcast/interview editing.

  • Volume / gain control — per-selection or global audio gain slider. Every editor has this. Descript users constantly complain it's missing. Backend: ffmpeg -af volume=Xdb.

  • Speed adjustment — slow down or speed up a selection or the whole clip. Backend: ffmpeg -filter:v setpts + atempo. Common use case: slightly speed up boring sections.

  • Cut preview — before committing a delete, play what the audio will sound like with that section removed (pre-listen across the edit point). Pure frontend using Web Audio API — splice the AudioBuffer and play the join.

  • Timeline shows output length — deleted regions should visually collapse (or show as narrow gaps) so the user sees the output duration, not just the source duration.


🟡 Medium Priority — Widely expected features

  • Transcript search (Ctrl+F) — find words/phrases in the transcript and highlight matches. Pure frontend. Critical for long-form content. Jump between matches with Enter.

  • Mark In / Out + delete (I / O keys) — keyboard shortcuts to mark a time range on the timeline, then delete it. Faster than click-dragging words. Store the in/out points in state, Delete removes them.

  • Low-confidence word highlighting — WhisperX already returns confidence per word. Words below a threshold (e.g. < 0.6) should be visually underlined or tinted so the user knows where to double-check.

  • Re-transcribe selection — if Whisper gets a section wrong, let the user select a word range and re-run transcription on just that segment (optionally with a different model or language).

  • Word text correction — allow editing the transcript text of a word without affecting its timing. Whisper gets homophones/proper nouns wrong constantly. Pure frontend state change; no backend needed.

  • Named timeline markers — drop named marker pins on the waveform (like Resolve markers). Store as { id, time, label, color } in the project. Rendered as colored triangles on the timeline canvas.

  • Chapters — group markers into named chapter ranges. Useful for podcasts and lectures. Exportable as YouTube chapter timestamps in the description.


🟢 Lower Priority — Differentiating / power features

  • Audio normalization / loudness targeting — single "Normalize" button that targets a LUFS level (-14 for YouTube, -16 for Spotify). Backend: ffmpeg -af loudnorm. Very high value for podcasters, ~23 hours of work.

  • Background music track — a second audio track for background music with volume ducking. Major gap in Descript that TalkEdit could own. Backend: ffmpeg amix + asendcmd for auto-ducking.

  • Video zoom / punch-in — scale and position the video (crop, zoom, pan). Used constantly on talking-head videos for emphasis. Backend: ffmpeg -vf crop/scale/zoompan.

  • Multi-clip / append — load a second video and append it to the timeline. Even without a full multi-track timeline, "append clip" is a heavily used workflow.

  • Clip thumbnail strip — video frame thumbnails along the timeline so users can navigate visually, not only by waveform. Backend: ffmpeg thumbnail extraction at regular intervals.

  • Batch silence removal — full-file scan + remove all pauses above threshold in one click. Distinct from the manual trimmer above; this is a "fix the whole file" operation.

  • Export to transcript text / SRT only — some users just want a clean .txt or .srt of the edited transcript without rendering video.


💡 TalkEdit competitive advantages to lean into

These aren't features to build — they're things to make more visible in the UI and README:

  • 100% offline / no account required — CapCut requires login and sends data to servers. Descript is cloud-first. TalkEdit never leaves the machine.
  • Local AI models — Ollama support means no API costs and no data leaving the device.
  • Word-level precision — editing by deleting words (not dragging razor cuts) is faster for talking-head content than any timeline-based editor.
  • Works on long files — virtualized transcript + chunked waveform handles 1hr+ content that bogs down CapCut.

Already Implemented

  • Word-level transcript editing (select, drag, shift-click, delete)
  • Ctrl+click word → seek timeline to that position
  • Waveform timeline with zoom (Ctrl+scroll), scroll, drag-to-scrub playhead
  • Auto-scroll waveform when playhead goes off-screen
  • AI filler word detection and removal (Ollama / OpenAI / Claude)
  • AI clip suggestions for social media
  • Noise reduction (DeepFilterNet or FFmpeg ANLMDN)
  • Export: fast stream-copy or full reencode (MP4/MOV/WebM, 720p/1080p/4K)
  • Captions: SRT, VTT, ASS burn-in with font/color/position options
  • Speaker diarization
  • Project save / load (.aive JSON format)
  • Undo / redo (100-level history via Zundo)
  • Multi-format input (MP4, MKV, MOV, AVI, WebM, M4A)
  • Keyboard shortcuts (Space, J/K/L, arrows, Ctrl+Z/Shift+Z, Ctrl+S, Ctrl+E)
  • Settings panel: AI provider config (Ollama, OpenAI, Claude)