Knowledge Pipeline (Project Clone) — Detailed Report
Goal and Motivation
This project builds an automated Chinese video content extraction pipeline, converting all video content from two financial KOLs on Xiaohongshu (XHS) — Jeremy (fundamental analysis / value investing) and Macro Robin (macroeconomics / geopolitics) — into a structured, searchable text database.
The ultimate goal is to extract spoken knowledge from these videos (covering Wall Street investment methodology, Buffett-style value investing, macroeconomic analysis, etc.) into text for subsequent RAG knowledge base construction, semantic search, and content analysis applications.
Technical Architecture
| Layer | Technology | Purpose | |-------|-----------|---------| | Scraper | Playwright, browser-cookie3, Chrome DevTools JS injection | XHS post scraping and video URL extraction | | Video Download | requests + CDN direct connect | Download .mp4 from XHS CDN | | Audio Extraction | FFmpeg 8.1 (Homebrew) | MP4 to mono 16kHz PCM WAV | | Speech-to-Text | MLX Whisper (mlx-community/whisper-base-mlx) | Apple Silicon GPU-accelerated Chinese speech recognition | | Subtitle OCR | Apple Vision Framework (VNRecognizeTextRequest via pyobjc) | Neural Engine hardware-accelerated Chinese OCR | | Text Merging | SequenceMatcher (difflib) | Whisper + OCR intelligent fusion | | Data Storage | SQLite3 (WAL mode) | Structured storage, 4 tables | | Data Export | CSV (UTF-8 BOM) | Directly openable in Excel/Sheets | | Runtime | Python 3.14, M1 Pro (10-core/16GB) | macOS native |
Key Design Decision: Replaced faster-whisper (CPU-only on Mac) with MLX Whisper and PaddleOCR (requiring separate Python 3.12 virtual environment) with Apple Vision. Processing time compressed from estimated 6-9 hours to approximately 1 hour.
Data Processing Pipeline
Complete Flow (5 Steps)
Step 1: FFmpeg Audio Extraction — MP4 to WAV (16kHz mono), parallel processing with 4 workers
Step 2: MLX Whisper Speech Recognition — Apple GPU accelerated at ~50x real-time speed. Output: timestamped text segments as JSON
Step 3: Apple Vision OCR Subtitle Extraction — 2 FPS frame extraction, Neural Engine OCR with deduplication. Jeremy: bottom 30% crop (subtitles at bottom). Robin: full-frame scan (text in center). Output: frame-second aligned text with confidence scores
Step 4: Intelligent Merging — 3-second time tolerance alignment. SequenceMatcher similarity > 0.3 prefers OCR text (OCR corrects Whisper's technical terminology errors). Output: merged segments with whisper_text, ocr_text, and final_text fields
Step 5: CSV Export — Segment-level (one row per speech segment with all three text versions) and full-text level (one row per video with complete transcript)
Database Schema (4 Tables)
posts— Video metadata (title, type, status flags)transcript_segments— Whisper speech recognition resultsocr_segments— OCR subtitle recognition resultsmerged_transcript— Final merged text
Progress and Data
Processing Status
| Item | Status | |------|--------| | Jeremy scraping | Complete (28/281 post metadata, account-limited) | | Jeremy video download | Complete (158 MP4s + 121 images) | | Jeremy pipeline | All complete (158/158) | | Robin video copy | Complete (75 MP4s) | | Robin pipeline | All complete (75/75) | | Vector DB / RAG | Not yet started |
Key Metrics
| Metric | Jeremy | Robin | Total | |--------|--------|-------|-------| | Videos | 158 | 75 | 233 | | Total Duration | 12.3 hours | 6.0 hours | 18.3 hours | | Raw Video Size | 15 GB | 526 MB | ~15.5 GB | | Whisper Segments | 22,114 | 9,114 | 31,228 | | Whisper Words | 223,000 | 116,000 | 339,000 | | OCR Segments | 23,050 | 18,930 | 41,980 | | OCR Words | 230,000 | 657,000 | 887,000 | | Final Merged Text | 215,000 | 90,000 | 305,000 | | Estimated Tokens | ~320,000 | ~135,000 | ~455,000 | | CSV Rows | 22,115 | 9,115 | 31,230 | | SQLite DB Size | 7.5 MB | 5.0 MB | 12.5 MB | | Extraction Results Size | 1.3 GB | 680 MB | ~2 GB |
Technical Challenges
Challenge 1: XHS Anti-Scraping (Account Abnormal Block)
Using scrape_all_posts_browser.py to call XHS's /api/sns/web/v1/user_posted API triggered an "Account abnormal" error, blocking API pagination and returning 404 on individual post pages.
Attempted: API scripts from Jeremy's homepage, headless Playwright, JS fetch with xsec_token, programmatic .click() triggering.
Final Solution: Chrome browser automation directly on XHS pages. Scroll-collected 281 post IDs, then physically clicked each post card via MCP browser tools, extracting complete data from __INITIAL_STATE__. Successfully extracted 7 posts' complete data before the user completed all video downloads manually.
Challenge 2: Playwright Cannot Penetrate VPN
Playwright's headless Chromium could not connect to XHS (ERR_CERT_AUTHORITY_INVALID, ERR_TIMED_OUT) because the user's VPN only applied to system Chrome, not Playwright's standalone browser.
Solution: Abandoned Playwright, used Chrome MCP extension to operate directly in the user's Chrome browser.
Challenge 3: Apple Vision OCR CGImage Initialization
Using VNImageRequestHandler.initWithCGImage_options_() returned 0 results even with clear text in images.
Diagnosis: CGImageForProposedRect_context_hints_ returns CGImage in a format not fully compatible with Vision framework.
Solution: Switched to VNImageRequestHandler.initWithURL_options_() passing file URL directly. OCR immediately worked with confidence scores reaching 1.0.
Challenge 4: FFmpeg Percent Character in Filenames
3 videos had % or + in filenames (e.g., content about "Buffett earning 10%"). FFmpeg misinterpreted % in output paths as format specifiers.
Solution: Sanitized temporary frame directory names in ocr_vision.py (% to _pct_, + to _plus_).
Challenge 5: INSERT OR REPLACE Overwriting Status Flags
Pipeline steps all called insert_post() to register videos, but INSERT OR REPLACE reset previously set transcribed, ocr_done flags to default 0.
Solution: Changed to INSERT OR IGNORE, manually repaired corrupted DB state flags, re-executed merge and export steps.
Challenge 6: Different Subtitle Positions Between KOLs
Jeremy's subtitles appear in the bottom 30% of the screen; Robin's text appears in the center. Bottom-crop optimization for Jeremy caused Robin's OCR to return 0 segments.
Solution: Stopped in-progress pipeline, modified Robin's extract_frames() to full-frame scan (removed crop parameter), deleted blank OCR results, reset DB flags, re-executed OCR.
Most Valuable Technical Achievements
Apple Silicon Hardware-Accelerated Pipeline
The project's core innovation: replacing generic cross-platform solutions with macOS-specific hardware acceleration:
- MLX Whisper (GPU) replacing faster-whisper (CPU): Speech recognition from ~1x real-time to ~50x real-time
- Apple Vision (Neural Engine) replacing PaddleOCR (CPU): No extra Python environment, direct system API calls
Result: 158 videos (12.3 hours of content) processed in ~1.5 hours instead of estimated 6-9 hours — a 4-6x speedup.
Whisper + OCR Intelligent Merging
Not a simple either/or selection but time-aligned fusion (3-second tolerance) with SequenceMatcher similarity matching, letting OCR correct Whisper's technical terminology errors:
| Whisper (Speech) | OCR (Subtitle) | Final Result | |-----------------|----------------|--------------| | "gu dong xing" (homophone error) | Shareholder Letter | Shareholder Letter | | "bo ke xi er ha si wei" (transliteration error) | Berkshire Hathaway | Berkshire Hathaway |
This is particularly valuable for financial domain terminology where Whisper's homophone errors are effectively corrected by OCR.
Replicable Pipeline Architecture
The entire pipeline/ module is designed to be portable — just copy the folder, modify config naming, and deploy a complete processing pipeline for a new KOL in minutes. Validated in practice: expanding from Jeremy to Robin took only 5 minutes setup + 1 hour automated processing.
Report Date: April 13, 2026