返回列表
report2026-04-136 分鐘閱讀

Knowledge Pipeline (Project Clone) — Detailed Report

Goal and Motivation

This project builds an automated Chinese video content extraction pipeline, converting all video content from two financial KOLs on Xiaohongshu (XHS) — Jeremy (fundamental analysis / value investing) and Macro Robin (macroeconomics / geopolitics) — into a structured, searchable text database.

The ultimate goal is to extract spoken knowledge from these videos (covering Wall Street investment methodology, Buffett-style value investing, macroeconomic analysis, etc.) into text for subsequent RAG knowledge base construction, semantic search, and content analysis applications.


Technical Architecture

| Layer | Technology | Purpose | |-------|-----------|---------| | Scraper | Playwright, browser-cookie3, Chrome DevTools JS injection | XHS post scraping and video URL extraction | | Video Download | requests + CDN direct connect | Download .mp4 from XHS CDN | | Audio Extraction | FFmpeg 8.1 (Homebrew) | MP4 to mono 16kHz PCM WAV | | Speech-to-Text | MLX Whisper (mlx-community/whisper-base-mlx) | Apple Silicon GPU-accelerated Chinese speech recognition | | Subtitle OCR | Apple Vision Framework (VNRecognizeTextRequest via pyobjc) | Neural Engine hardware-accelerated Chinese OCR | | Text Merging | SequenceMatcher (difflib) | Whisper + OCR intelligent fusion | | Data Storage | SQLite3 (WAL mode) | Structured storage, 4 tables | | Data Export | CSV (UTF-8 BOM) | Directly openable in Excel/Sheets | | Runtime | Python 3.14, M1 Pro (10-core/16GB) | macOS native |

Key Design Decision: Replaced faster-whisper (CPU-only on Mac) with MLX Whisper and PaddleOCR (requiring separate Python 3.12 virtual environment) with Apple Vision. Processing time compressed from estimated 6-9 hours to approximately 1 hour.


Data Processing Pipeline

Complete Flow (5 Steps)

Step 1: FFmpeg Audio Extraction — MP4 to WAV (16kHz mono), parallel processing with 4 workers

Step 2: MLX Whisper Speech Recognition — Apple GPU accelerated at ~50x real-time speed. Output: timestamped text segments as JSON

Step 3: Apple Vision OCR Subtitle Extraction — 2 FPS frame extraction, Neural Engine OCR with deduplication. Jeremy: bottom 30% crop (subtitles at bottom). Robin: full-frame scan (text in center). Output: frame-second aligned text with confidence scores

Step 4: Intelligent Merging — 3-second time tolerance alignment. SequenceMatcher similarity > 0.3 prefers OCR text (OCR corrects Whisper's technical terminology errors). Output: merged segments with whisper_text, ocr_text, and final_text fields

Step 5: CSV Export — Segment-level (one row per speech segment with all three text versions) and full-text level (one row per video with complete transcript)

Database Schema (4 Tables)

  • posts — Video metadata (title, type, status flags)
  • transcript_segments — Whisper speech recognition results
  • ocr_segments — OCR subtitle recognition results
  • merged_transcript — Final merged text

Progress and Data

Processing Status

| Item | Status | |------|--------| | Jeremy scraping | Complete (28/281 post metadata, account-limited) | | Jeremy video download | Complete (158 MP4s + 121 images) | | Jeremy pipeline | All complete (158/158) | | Robin video copy | Complete (75 MP4s) | | Robin pipeline | All complete (75/75) | | Vector DB / RAG | Not yet started |

Key Metrics

| Metric | Jeremy | Robin | Total | |--------|--------|-------|-------| | Videos | 158 | 75 | 233 | | Total Duration | 12.3 hours | 6.0 hours | 18.3 hours | | Raw Video Size | 15 GB | 526 MB | ~15.5 GB | | Whisper Segments | 22,114 | 9,114 | 31,228 | | Whisper Words | 223,000 | 116,000 | 339,000 | | OCR Segments | 23,050 | 18,930 | 41,980 | | OCR Words | 230,000 | 657,000 | 887,000 | | Final Merged Text | 215,000 | 90,000 | 305,000 | | Estimated Tokens | ~320,000 | ~135,000 | ~455,000 | | CSV Rows | 22,115 | 9,115 | 31,230 | | SQLite DB Size | 7.5 MB | 5.0 MB | 12.5 MB | | Extraction Results Size | 1.3 GB | 680 MB | ~2 GB |


Technical Challenges

Challenge 1: XHS Anti-Scraping (Account Abnormal Block)

Using scrape_all_posts_browser.py to call XHS's /api/sns/web/v1/user_posted API triggered an "Account abnormal" error, blocking API pagination and returning 404 on individual post pages.

Attempted: API scripts from Jeremy's homepage, headless Playwright, JS fetch with xsec_token, programmatic .click() triggering.

Final Solution: Chrome browser automation directly on XHS pages. Scroll-collected 281 post IDs, then physically clicked each post card via MCP browser tools, extracting complete data from __INITIAL_STATE__. Successfully extracted 7 posts' complete data before the user completed all video downloads manually.

Challenge 2: Playwright Cannot Penetrate VPN

Playwright's headless Chromium could not connect to XHS (ERR_CERT_AUTHORITY_INVALID, ERR_TIMED_OUT) because the user's VPN only applied to system Chrome, not Playwright's standalone browser.

Solution: Abandoned Playwright, used Chrome MCP extension to operate directly in the user's Chrome browser.

Challenge 3: Apple Vision OCR CGImage Initialization

Using VNImageRequestHandler.initWithCGImage_options_() returned 0 results even with clear text in images.

Diagnosis: CGImageForProposedRect_context_hints_ returns CGImage in a format not fully compatible with Vision framework.

Solution: Switched to VNImageRequestHandler.initWithURL_options_() passing file URL directly. OCR immediately worked with confidence scores reaching 1.0.

Challenge 4: FFmpeg Percent Character in Filenames

3 videos had % or + in filenames (e.g., content about "Buffett earning 10%"). FFmpeg misinterpreted % in output paths as format specifiers.

Solution: Sanitized temporary frame directory names in ocr_vision.py (% to _pct_, + to _plus_).

Challenge 5: INSERT OR REPLACE Overwriting Status Flags

Pipeline steps all called insert_post() to register videos, but INSERT OR REPLACE reset previously set transcribed, ocr_done flags to default 0.

Solution: Changed to INSERT OR IGNORE, manually repaired corrupted DB state flags, re-executed merge and export steps.

Challenge 6: Different Subtitle Positions Between KOLs

Jeremy's subtitles appear in the bottom 30% of the screen; Robin's text appears in the center. Bottom-crop optimization for Jeremy caused Robin's OCR to return 0 segments.

Solution: Stopped in-progress pipeline, modified Robin's extract_frames() to full-frame scan (removed crop parameter), deleted blank OCR results, reset DB flags, re-executed OCR.


Most Valuable Technical Achievements

Apple Silicon Hardware-Accelerated Pipeline

The project's core innovation: replacing generic cross-platform solutions with macOS-specific hardware acceleration:

  • MLX Whisper (GPU) replacing faster-whisper (CPU): Speech recognition from ~1x real-time to ~50x real-time
  • Apple Vision (Neural Engine) replacing PaddleOCR (CPU): No extra Python environment, direct system API calls

Result: 158 videos (12.3 hours of content) processed in ~1.5 hours instead of estimated 6-9 hours — a 4-6x speedup.

Whisper + OCR Intelligent Merging

Not a simple either/or selection but time-aligned fusion (3-second tolerance) with SequenceMatcher similarity matching, letting OCR correct Whisper's technical terminology errors:

| Whisper (Speech) | OCR (Subtitle) | Final Result | |-----------------|----------------|--------------| | "gu dong xing" (homophone error) | Shareholder Letter | Shareholder Letter | | "bo ke xi er ha si wei" (transliteration error) | Berkshire Hathaway | Berkshire Hathaway |

This is particularly valuable for financial domain terminology where Whisper's homophone errors are effectively corrected by OCR.

Replicable Pipeline Architecture

The entire pipeline/ module is designed to be portable — just copy the folder, modify config naming, and deploy a complete processing pipeline for a new KOL in minutes. Validated in practice: expanding from Jeremy to Robin took only 5 minutes setup + 1 hour automated processing.


Report Date: April 13, 2026