[AI] Whisper 語音轉文字完整教學:支援中文翻譯、SRT 字幕、自動切段與 MPS 加速

這篇教學將帶你一步步完成 Whisper 語音轉錄系統的建置,支援多語言轉換、字幕輸出、自動分段,並支援 Apple M1/M2 裝置使用 MPS 加速,適合 macOS 用戶。


🛠️ 前置準備

1. 安裝 Python 環境(建議 Python 3.10 以上)

建議使用 pyenv 管理多版本 Python:

brew install pyenv pyenv install 3.10.13 pyenv global 3.10.13

或確認你系統已具備適當版本:

python3 --version

2. 建立虛擬環境與安裝套件

python3 -m venv whisper-env source whisper-env/bin/activate pip install -U pip setuptools wheel pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu pip install transformers pip install ffmpeg-python

macOS M1/M2 用戶將自動啟用 MPS 加速(使用 Metal)

3. 安裝 ffmpeg

brew install ffmpeg

▶️ 執行方式

將完整程式碼儲存為 run_whisper.py 後,在終端機執行:

python run_whisper.py

執行流程會引導你選擇語音處理模式與音訊檔案,並於處理完成後輸出:

  • xxx_transcribed.txt(純文字稿)
  • xxx.srt(字幕檔)

📄 完整程式碼

請將下方完整 Python 程式碼貼入這裡:

#!/usr/bin/env python3 # -*- coding: utf-8 -*- import os import subprocess import torch import time from datetime import timedelta from difflib import SequenceMatcher from concurrent.futures import ThreadPoolExecutor, as_completed from transformers import WhisperProcessor, WhisperForConditionalGeneration import torchaudio # ✅ 裝置選擇:MPS or CPU device = "mps" if torch.backends.mps.is_available() else "cpu" device_torch = torch.device(device) print(f"📟 使用裝置:{device.upper()}") # 🌍 翻譯選單(已更新繁體中文說明) print("🌐 請選擇語音處理模式:") print("1. 自動偵測語言") print("2. 英語") print("3. 中文") print("4. 轉譯為英文文檔") mode = input("請輸入選項 (1/2/3/4):").strip() if mode == "2": task = "transcribe" language = "en" elif mode == "3": task = "transcribe" language = "zh" elif mode == "4": task = "translate" language = "en" else: task = "transcribe" language = None # 使用 Whisper 自動語言偵測 # 🔍 選擇音訊檔 audio_files = [f for f in os.listdir() if f.lower().endswith((".m4a", ".mp3", ".wav"))] if not audio_files: print("❌ 找不到音訊檔案") exit() print("🎧 可選音訊:") for i, name in enumerate(audio_files, 1): print(f"{i}. {name}") idx = int(input("請輸入要處理的編號:")) - 1 audio_path = audio_files[idx] output_base = os.path.splitext(audio_path)[0] # ✅ 切段參數 chunk_length = 90 overlap = 5 # 🔪 切段為重疊音檔 print("🔪 使用 ffmpeg 擷取重疊切段中...") os.makedirs("chunks", exist_ok=True) duration_cmd = subprocess.run( ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "default=noprint_wrappers=1:nokey=1", audio_path], stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True ) total_duration = float(duration_cmd.stdout.strip()) segments = [] start = 0 i = 0 while start < total_duration: segment_path = f"chunks/chunk_{i:03d}.wav" segments.append((segment_path, start)) subprocess.run([ "ffmpeg", "-y", "-i", audio_path, "-ss", str(start), "-t", str(chunk_length), "-ac", "1", "-ar", "16000", segment_path ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) start += chunk_length - overlap i += 1 # ✅ Whisper Processor + Model model_id = "openai/whisper-large-v3-turbo" processor = WhisperProcessor.from_pretrained(model_id) model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device_torch) # ✅ 處理單段音檔 def process_segment(i, segment_path, start_offset, prev_text): waveform, sr = torchaudio.load(segment_path) proc_args = { "sampling_rate": sr, "return_tensors": "pt", "task": task } if language is not None: proc_args["language"] = language input_data = processor(waveform[0], **proc_args) input_features = input_data.input_features.to(device_torch) attention_mask = input_data.get("attention_mask") if attention_mask is not None: attention_mask = attention_mask.to(device_torch) generated_ids = model.generate( input_features=input_features, attention_mask=attention_mask ) text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() lang = language if language else "auto" start_ts = start_offset end_ts = start_offset + chunk_length def remove_overlap(a, b): matcher = SequenceMatcher(None, a, b) match = matcher.find_longest_match(0, len(a), 0, len(b)) if match.size > 10 and match.b > 0: return b[match.b + match.size:].strip() return b clean_text = remove_overlap(prev_text, text) return (i, start_ts, end_ts, clean_text, lang, text) # 🧠 並行處理段落 print(f"🧠 開始轉錄,共 {len(segments)} 段") start_time = time.time() results = [] prev_text = "" with ThreadPoolExecutor(max_workers=2 if device == "mps" else os.cpu_count()) as executor: futures = [] for i, (segment_path, start_offset) in enumerate(segments): futures.append(executor.submit(process_segment, i, segment_path, start_offset, prev_text)) for future in as_completed(futures): results.append(future.result()) # 排序與去重 results.sort(key=lambda x: x[0]) final_segments = [] for i, start_ts, end_ts, clean_text, lang, full_text in results: if clean_text: final_segments.append((start_ts, end_ts, clean_text, lang)) prev_text = full_text # ✅ 輸出 TXT with open(f"{output_base}_transcribed.txt", "w", encoding="utf-8") as f: for i, (start, end, text, lang) in enumerate(final_segments, 1): f.write(f"--- 第 {i} 段 [{lang}] ---\n{text}\n\n") # ✅ 輸出 SRT def format_srt_time(seconds): t = timedelta(seconds=int(seconds)) ms = int((seconds - int(seconds)) * 1000) return f"{str(t).zfill(8).replace('.', ',')},{ms:03d}" with open(f"{output_base}.srt", "w", encoding="utf-8") as f: for i, (start, end, text, _) in enumerate(final_segments, 1): f.write(f"{i}\n") f.write(f"{format_srt_time(start)} --> {format_srt_time(end)}\n") f.write(f"{text}\n\n") # 🧹 清理 chunks for f in os.listdir("chunks"): os.remove(os.path.join("chunks", f)) os.rmdir("chunks") print(f"\n✅ 已完成!輸出:\n- {output_base}_transcribed.txt\n- {output_base}.srt") print(f"⏱ 總耗時:{timedelta(seconds=int(time.time() - start_time))}")

你也可以建立一個 whisper.command(或 whisper.sh)檔案,直接透過雙擊執行:

#!/bin/zsh export PYENV_ROOT="$HOME/.pyenv" export PATH="$PYENV_ROOT/bin:$PATH" eval "$(/opt/homebrew/bin/brew shellenv)" eval "$(pyenv init --path)" eval "$(pyenv init -)" eval "$(pyenv virtualenv-init -)" # 使用 pyenv 虛擬環境 whisper-env pyenv activate whisper-env cd /Users/cindy/Documents/whisper python whisper_transcribe.py echo "" read -k 1 -s "?✅ 執行完成,按任意鍵關閉..."

你可以將此檔案儲存為 whisper.command 並給予執行權限:

chmod +x whisper.command

之後即可直接點擊執行。

📁 範例輸出

本範例為使用 MacBook Air M2(8GB RAM) 處理一段約 47 分鐘 的會議錄音,並產出逐字稿與字幕檔案。轉錄過程約耗時 3 分 8 秒,MPS 加速模式啟用成功。

Whisper 執行畫面截圖

🛠️ 常見問題 FAQ

Q1. 為什麼 m4a 格式讀不到?

請確認 ffmpeg 已安裝,或手動轉檔為 wav:

ffmpeg -i input.m4a -ac 1 -ar 16000 output.wav

Q2. 可以加上說話者辨識嗎?

這版本尚未整合 Speaker Diarization,但可結合 WhisperX 或 pyannote.audio 延伸。

Q3. 可以轉成繁體中文嗎?

可以,在輸出 .txt 後使用 OpenCC 或人工翻譯工具轉換簡體為繁體。

Q4. 轉檔後有漏段落?(2025/06/09 update)  

轉換過程因為RAM容量不足被系統強制結束掉該段落,尤其是我手上的機器RAM只有8G,這個問題不會特別報錯,回頭審查錄音內容才發現這個問題.

解決方式是在切斷檔案的長度縮更小,大概30s一段應該可以避免掉這個問題.

 


如果你覺得這篇文章對你有幫助,歡迎收藏或留言交流!

撰寫日期:2025-04-16|作者:KingChang with ChatGPT

留言

這個網誌中的熱門文章

[旅遊]日本大阪部品(上) - 南海部品

[婚禮] 彭園婚宴會館-台北館 華麗宴會廳