plans and features

2026-05-06 02:29:10 -06:00
parent 91217f6db0
commit 88cd9a21d0
2 changed files with 172 additions and 66 deletions
--- a/plan.md
+++ b/plan.md
@ -1,81 +1,158 @@
-# Plan for Building TalkEdit (Whisper.cpp + Tauri)
+# TalkEdit — Launch Plan

-Based on your original idea summary and our discussions, here's a detailed plan to build a standalone, local audio/video editor app. We'll modify CutScript as the base, migrate to **Tauri 2.0** (Rust backend + React frontend) for tiny, dependency-free installers, and use **Whisper.cpp** for fast, accurate transcription. This keeps the scope minimal, focuses on text-based editing for spoken content, and targets podcasters/YouTubers.
+## Niche: "Descript for long-form content"

-## 1. Overview
- **Goal**: Create an offline Descript alternative with word-level editing, transcription, and export. Users download one file (~10–20MB), install, and run—no Python, FFmpeg, or external deps.
- **Why This Stack**: Tauri bundles everything into a native app; Whisper.cpp (C++ lib) integrates seamlessly with Rust for CPU-efficient transcription. Faster than rebuilding from scratch.
- **Target Users**: Creators editing podcasts/videos; free core + Pro upgrades.
- **Key Differentiators**: Fully local, text-based editing like Google Docs, smart cuts with fades.
+TalkEdit's defensible position: **works on hour+ files without degrading**, fully offline, one-time payment. No competitor owns this — Descript chokes on long content, CapCut limits mobile uploads, and both require accounts.

-## 2. Tech Stack
- **Frontend**: React + Vite + Tailwind CSS + shadcn/ui (from CutScript; minimal changes).
- **Backend**: Tauri 2.0 (Rust) – handles file I/O, FFmpeg calls, Whisper.cpp integration.
- **Transcription**: Whisper.cpp (via Rust bindings like `whisper-cpp-sys` or `whisper-rs`).
- **Audio/Video Processing**: FFmpeg (bundled or called via Rust wrappers like `ffmpeg-next`).
- **State Management**: Zustand (from CutScript).
- **Packaging**: Tauri's `tauri build` for cross-platform installers.
- **AI Features**: Local models only (no APIs); optional Ollama for fillers.
+---

-## 3. Step-by-Step Development Plan
-1. **Set Up Tauri in CutScript** (1–2 weeks):
-   - Install `tauri-cli` globally.
-   - In CutScript root: `npx tauri init` (choose Rust backend, link to existing React frontend).
-   - Migrate Electron main.js to Tauri's `src/main.rs` (handle window, file dialogs).
-   - Update `tauri.conf.json` for app metadata, bundle settings.
+## Phase 1: Polish (pre-launch, do this first)

-2. **Integrate Whisper.cpp in Rust** (2–3 weeks):
-   - Add `whisper-cpp` as a dependency in `Cargo.toml`.
-   - Create a Rust module for transcription: Load models, process audio, return word-level timestamps.
-   - Replace Python backend calls with Tauri commands (e.g., `invoke` from frontend to Rust for transcription).
-   - Handle model downloads on first run (store in app data dir).
+### Reliability & error handling
+- [ ] Handle backend crashes gracefully — if Python backend dies, show a reconnect banner, don't leave UI in broken state
+- [ ] Transcription failure recovery — show which model/download step failed, suggest alternatives
+- [ ] Export failure reporting — surface FFmpeg stderr to user in a readable way
+- [ ] File locking / concurrent access — prevent exporting while transcription is running and vice versa

-3. **Migrate Audio/Video Logic** (2 weeks):
-   - Port FFmpeg calls to Rust (use `ffmpeg-next` for cutting/export).
-   - Implement segment calculation: From edited transcript, build keep_segments with padding/fades.
-   - Add audio cleaning (noise reduction via bundled tools or Rust libs).
+### UX roughness
+- [ ] Drag-and-drop file import onto the welcome screen
+- [ ] Loading spinners for every async action with descriptive messages ("Downloading model...", "Analyzing silence...")
+- [ ] Undo/redo visual feedback — toast notification "Undo: removed cut"
+- [ ] `?` keyboard shortcut opens a proper cheat sheet modal
+- [ ] Save indicator — dot or "unsaved" badge next to project name
+- [ ] Disabled state for all buttons during export/transcription to prevent double-clicks
+- [ ] Empty states for every panel ("Add your first marker", "No silence detected", etc.)

-4. **Frontend Polish** (1–2 weeks):
-   - Update UI for Tauri (file dialogs via `tauri-plugin-dialog`).
-   - Refine transcript editor: Better timestamp syncing, manual adjustments.
-   - Add export options (MP4 with subs, audio-only).
+### Trial & licensing UX
+- [ ] Wire up the "Activate" link to a real payment page (not placeholder `talked.it`)
+- [ ] Show days remaining in the welcome screen bar (done)
+- [ ] After trial expires, clearly explain what still works (export, loading) vs what's locked (editing, AI)
+- [ ] License key field should handle paste + validate format client-side before sending to Rust

-5. **Testing & Packaging** (1 week):
-   - Test on Windows/macOS/Linux; ensure Whisper runs offline.
-   - Bundle with `tauri build`; verify no external deps.
-   - Add auto-updater for Pro features.
+### Performance
+- [ ] Lazy-load the waveform for very long files (>2hr) — don't fetch entire WAV at once
+- [ ] Virtualize the waveform canvas rendering (only draw visible portion), not just transcript
+- [ ] Debounce project auto-save

-6. **Launch & Iterate** (Ongoing):
-   - Open-source core on GitHub.
-   - Market on Product Hunt, Reddit; gather feedback.
+---

-## 4. MVP Features (Minimal but Useful)
-Focus on what creators need for spoken content:
- **Drag-and-drop import**: Audio/video files; auto-extract audio.
- **One-click transcription**: Whisper.cpp with model choice (Fast - less accurate: tiny/base; Slow - more accurate: small/medium/large).
- **Text-based editing**: Scrollable transcript; click word → jump to video; select/delete words → auto-cut audio with 150ms fades.
- **Smart cleanup**: Remove fillers ("um", pauses >0.8s) via local AI.
- **Preview & Export**: Synced preview; export MP4/audio with optional SRT subs.
- **Undo/Redo**: Full edit history.
+## Phase 2: Standout features (own the niche)

-No multi-track, voice cloning, or collaboration—keep it simple.
+### Long-form content (win here)
+- [ ] **Chapter-based navigation** — markers auto-sorted, click to jump, usable on 3hr files (partially done)
+- [ ] **Per-segment re-transcription** without losing surrounding context (done)
+- [ ] **Append multiple clips** into one timeline (done)
+- [ ] **Project stitching** — load multiple `.aive` projects, combine into one export
+- [ ] **Session memory** — re-open last project on launch, auto-restore cursor position and scroll
+- [ ] **Smart chunking** for transcription — for files >2hr, transcribe in overlapping chunks and stitch seamlessly

-## 4. Notes
- Consider adding Parakeet TDT as a transcription option in the future for users who want alternatives to Whisper.
+### Export differentiation
+- [ ] **YouTube chapters** — auto-generate from markers, copy as timestamps (done)
+- [ ] **Export chapter markers** — embed in MP4/MKV metadata for chapter skip in players
+- [ ] **Batch export** — export multiple projects or multiple cuts from one project in sequence
+- [ ] **Export transcript format presets** — SRT, VTT, TXT, TXT with timestamps, markdown (done)

-## 5. Monetization Model
- **Free Forever**: Core editing/transcription (unlimited local use).
- **Pro License** ($29–49 one-time): Batch processing, high-quality voices (if adding TTS), custom presets, priority support.
- **Optional Add-Ons**: Cloud credits for long videos (rarely needed).
+### AI features (local-first moat)

-## 6. Timeline & Milestones
- **Weeks 1–4**: Tauri setup + Whisper integration.
- **Weeks 5–6**: Audio logic migration + frontend tweaks.
- **Weeks 7–8**: Testing, packaging, launch prep.
- **Total**: 6–10 weeks to MVP (solo dev + AI).
+All powered by the bundled Qwen3 LLM. No API keys, no cloud calls. Features are grouped by how much they contribute to the core workflow.

-## 7. Risks & Tips
- **Risks**: Whisper.cpp compilation issues; Rust learning curve if new to it.
- **Tips**: Start with small models (base ~70MB); test timestamp accuracy early. Use Tauri's docs for migration. If stuck, fall back to bundling Python for Whisper (but avoid for true standalone).
- **Resources**: Tauri docs, Whisper.cpp GitHub, Rust audio crates.</content>
-<parameter name="filePath">/home/dillon/_code/audio_editor/plan.md
+#### Content creation (biggest value)
+- [ ] **Smart Shorts finder** — scan the full transcript for self-contained, engaging segments 10–90s long. Ranks by: narrative completeness (has a beginning/end), energy cues from transcript sentiment, and topic boundaries. Results shown as a list of suggested cut ranges with preview. One-click export as a separate short video or copy timecodes.
+- [ ] **Sound bite / quotable moment finder** — find punchy, standalone sentences that work as clips for social media. Ranks by quotability. Different from Shorts finder: these are <15s, single-sentence, high-impact lines.
+- [ ] **Hook analyzer** — score the first 30 seconds of the video for engagement. Suggests cuts or rephrases to make the intro stronger. Shows a "hook score" 1–10.
+- [ ] **Title + description generator** — suggest 5 titles and a YouTube description from the transcript + markers. One-click copy.
+
+#### Editing acceleration 
+- [ ] **AI auto-chapters** — detect topic shifts from transcript → create timeline markers. Uses topic segmentation from LLM, not just silence gaps.
+- [ ] **AI Smart Clean** — one-pass: filler removal + silence trim + normalize (done)
+- [ ] **AI sentence rephrase** — right-click word → rephrase with AI → replace in transcript (also done, uses backend)
+- [ ] **AI pacing analysis** — flag segments where the speaker talks too fast or too slow for sustained periods. Suggests speed range adjustments or cuts.
+- [ ] **AI dead-air finder** — finds moments where nothing interesting is said for >5s (rambling, off-topic, false starts). Different from silence trimmer — this is content-based, not audio-level-based.
+- [ ] **AI readabilty scan** — flag sentences that are too long, complex, or jargon-heavy for spoken word. Suggests simpler alternatives.
+
+#### Metadata & distribution
+- [ ] **AI show notes** — generate title, description, key moments, and timestamps from transcript + markers
+- [ ] **AI keyword/tag extraction** — pull out 5–10 topic tags from the transcript. Useful for YouTube SEO or categorization
+- [ ] **AI question finder** — detect all questions asked in the video (speaker or guest). Useful for Q&A videos, AMAs, interviews — jump to each question instantly
+- [ ] **AI thumbnail text suggestion** — suggest short overlay text for video thumbnails based on the most compelling line in the video
+- [ ] **AI call-to-action finder** — detect where the speaker asks for likes/subscribes/comments. Lets you trim or reposition CTAs
+
+#### Accessibility & compliance
+- [ ] **AI content flagging** — flag profanity, sensitive topics, or copyrighted references in the transcript. Color-coded by category. Useful before publishing to restricted platforms
+- [ ] **AI language leveling** — rewrite transcript segments at a target reading grade level (e.g., "simplify to 8th-grade level"). Useful for educational content or broad audiences
+
+### Bundled local LLM (killer friction-killer)
+The biggest UX gap: users must set up Ollama or paste an API key to use AI features. Bundle two small models — download on first AI use, just like Whisper models.
+
+**Three-tier AI provider choice** (set once, persisted):
+
+| Option | Hardware needed | Download | Setup |
+|--------|----------------|----------|-------|
+| Qwen3 4B (recommended) | 8GB+ free RAM | 2.5 GB | None — auto-download |
+| Qwen3 1.7B (lightweight) | 4GB+ free RAM | 1.0 GB | None — auto-download |
+| Ollama (bring your own) | Any | None | User starts Ollama themselves |
+
+Default to Qwen3 4B. If the machine can't meet the RAM threshold (checked via Tauri at runtime), fall back to 1.7B or prompt to set up Ollama.
+
+- [ ] **Integrate llama.cpp Rust bindings** (`llama-cpp-rs` or `candle`) — replace Python `ai_provider.py` calls with native inference for bundled models
+- [ ] **Auto-download Qwen3 4B or 1.7B** on first AI action based on hardware check (GGUF Q4_K_M format, ~2.5GB / ~1GB). Same UX as Whisper download: progress bar, resume on interrupt
+- [ ] **Model selector** in Settings: "Qwen3 4B (fast, no setup)" vs "Qwen3 1.7B (lightweight)" vs Ollama vs OpenAI vs Claude. Default to Qwen3 4B
+- [ ] **Hardware detection on first AI use** — check total system RAM, recommend 4B if ≥8GB free, 1.7B if less. Skip download entirely if machine can't run either (fall back to Ollama/API prompt)
+- [ ] **GPU acceleration** — llama.cpp supports CUDA/Metal/Vulkan. Detect at runtime and enable if available
+- [ ] **For lightweight AI tasks** (filler detection, chapter titles, summarization) the bundled model handles them directly. Only task requiring heavier reasoning (rephrase, smart speed) get the full model
+- [ ] **Remove the Python backend dependency for AI** — once bundled LLM + Whisper.cpp handle everything, no need to ship Python for AI features. One less runtime dependency
+- [ ] GGUF model files are cached in app data dir, same as Whisper models
+
+**Why this wins:**
+- New users open the app, click "Detect filler words" and it just works. No API signup. No Docker. No "install Ollama" README steps.
+- Descript charges $24/mo and still requires internet. CapCut's AI features are cloud-only. TalkEdit gives you local AI with zero setup.
+- The same download-on-demand pattern already works for Whisper models — users understand it.
+- Two size options means it works on everything from a 16GB gaming laptop to an 8GB office machine.
+
+---
+
+## Phase 3: Launch prep
+
+### Marketing assets
+- [ ] Compare page: "TalkEdit vs Descript" — focus on offline, no subscription, long file support
+- [ ] Demo video showing a 2hr podcast cleaned in 3 clicks
+- [ ] Tagline: *"The offline video editor that doesn't slow down on long files"*
+- [ ] Landing page at talked.it with clear pricing (one-time license)
+
+### Distribution
+- [ ] Product Hunt launch
+- [ ] Post in r/podcasting, r/VideoEditing, r/selfhosted
+- [ ] GitHub release with binaries for Windows/macOS/Linux
+- [ ] Offer free licenses to podcasters with public feedback in exchange for testimonials
+
+### Pricing
+- [ ] **Free trial**: 30 days, full features (already done)
+- [ ] **Pro**: $39 one-time — permanent license, includes all current + future features
+- [ ] **Business**: $79 one-time — same but with priority support + volume licensing
+- [ ] No subscriptions, no recurring charges. One purchase = owned forever.
+
+---
+
+## Phase 4: Post-launch
+
+### Retention
+- [ ] In-app changelog on update
+- [ ] Email list for major releases (optional, no account required)
+- [ ] Community templates/sharing for export presets and filler lists
+
+### Growth features
+- [ ] **Sample video download** — "Try without your own media" button downloads a test file + pre-made transcript
+- [ ] **Built-in free music library** — 5-10 CC0 loops shipped with the app
+- [ ] **Export presets** — community-contributed, loaded from a JSON file
+
+---
+
+## Non-goals (explicitly defer)
+
+- Cloud sync / collaboration
+- Voice cloning / TTS
+- Full multi-track NLE timeline (transitions, picture-in-picture, etc.)
+- Mobile app
+- Subscription model
+- Training/fine-tuning models in-app
+- Image/video generation models (Stable Diffusion, etc.) — text-only LLM is sufficient for transcript tasks