這篇教學將帶你一步步完成 Whisper 語音轉錄系統的建置,支援多語言轉換、字幕輸出、自動分段,並支援 Apple M1/M2 裝置使用 MPS 加速,適合 macOS 用戶。
🛠️ 前置準備
1. 安裝 Python 環境(建議 Python 3.10 以上)
建議使用 pyenv 管理多版本 Python:
brew install pyenv
pyenv install 3.10.13
pyenv global 3.10.13
或確認你系統已具備適當版本:
python3 --version
2. 建立虛擬環境與安裝套件
python3 -m venv whisper-env
source whisper-env/bin/activate
pip install -U pip setuptools wheel
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install transformers
pip install ffmpeg-python
macOS M1/M2 用戶將自動啟用 MPS 加速(使用 Metal)
3. 安裝 ffmpeg
brew install ffmpeg
▶️ 執行方式
將完整程式碼儲存為 run_whisper.py
後,在終端機執行:
python run_whisper.py
執行流程會引導你選擇語音處理模式與音訊檔案,並於處理完成後輸出:
xxx_transcribed.txt
(純文字稿)
xxx.srt
(字幕檔)
📄 完整程式碼
請將下方完整 Python 程式碼貼入這裡:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import subprocess
import torch
import time
from datetime import timedelta
from difflib import SequenceMatcher
from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
# ✅ 裝置選擇:MPS or CPU
device = "mps" if torch.backends.mps.is_available() else "cpu"
device_torch = torch.device(device)
print(f"📟 使用裝置:{device.upper()}")
# 🌍 翻譯選單(已更新繁體中文說明)
print("🌐 請選擇語音處理模式:")
print("1. 自動偵測語言")
print("2. 英語")
print("3. 中文")
print("4. 轉譯為英文文檔")
mode = input("請輸入選項 (1/2/3/4):").strip()
if mode == "2":
task = "transcribe"
language = "en"
elif mode == "3":
task = "transcribe"
language = "zh"
elif mode == "4":
task = "translate"
language = "en"
else:
task = "transcribe"
language = None # 使用 Whisper 自動語言偵測
# 🔍 選擇音訊檔
audio_files = [f for f in os.listdir() if f.lower().endswith((".m4a", ".mp3", ".wav"))]
if not audio_files:
print("❌ 找不到音訊檔案")
exit()
print("🎧 可選音訊:")
for i, name in enumerate(audio_files, 1):
print(f"{i}. {name}")
idx = int(input("請輸入要處理的編號:")) - 1
audio_path = audio_files[idx]
output_base = os.path.splitext(audio_path)[0]
# ✅ 切段參數
chunk_length = 90
overlap = 5
# 🔪 切段為重疊音檔
print("🔪 使用 ffmpeg 擷取重疊切段中...")
os.makedirs("chunks", exist_ok=True)
duration_cmd = subprocess.run(
["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "default=noprint_wrappers=1:nokey=1", audio_path],
stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True
)
total_duration = float(duration_cmd.stdout.strip())
segments = []
start = 0
i = 0
while start < total_duration:
segment_path = f"chunks/chunk_{i:03d}.wav"
segments.append((segment_path, start))
subprocess.run([
"ffmpeg", "-y", "-i", audio_path,
"-ss", str(start),
"-t", str(chunk_length),
"-ac", "1", "-ar", "16000",
segment_path
], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
start += chunk_length - overlap
i += 1
# ✅ Whisper Processor + Model
model_id = "openai/whisper-large-v3-turbo"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device_torch)
# ✅ 處理單段音檔
def process_segment(i, segment_path, start_offset, prev_text):
waveform, sr = torchaudio.load(segment_path)
proc_args = {
"sampling_rate": sr,
"return_tensors": "pt",
"task": task
}
if language is not None:
proc_args["language"] = language
input_data = processor(waveform[0], **proc_args)
input_features = input_data.input_features.to(device_torch)
attention_mask = input_data.get("attention_mask")
if attention_mask is not None:
attention_mask = attention_mask.to(device_torch)
generated_ids = model.generate(
input_features=input_features,
attention_mask=attention_mask
)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
lang = language if language else "auto"
start_ts = start_offset
end_ts = start_offset + chunk_length
def remove_overlap(a, b):
matcher = SequenceMatcher(None, a, b)
match = matcher.find_longest_match(0, len(a), 0, len(b))
if match.size > 10 and match.b > 0:
return b[match.b + match.size:].strip()
return b
clean_text = remove_overlap(prev_text, text)
return (i, start_ts, end_ts, clean_text, lang, text)
# 🧠 並行處理段落
print(f"🧠 開始轉錄,共 {len(segments)} 段")
start_time = time.time()
results = []
prev_text = ""
with ThreadPoolExecutor(max_workers=2 if device == "mps" else os.cpu_count()) as executor:
futures = []
for i, (segment_path, start_offset) in enumerate(segments):
futures.append(executor.submit(process_segment, i, segment_path, start_offset, prev_text))
for future in as_completed(futures):
results.append(future.result())
# 排序與去重
results.sort(key=lambda x: x[0])
final_segments = []
for i, start_ts, end_ts, clean_text, lang, full_text in results:
if clean_text:
final_segments.append((start_ts, end_ts, clean_text, lang))
prev_text = full_text
# ✅ 輸出 TXT
with open(f"{output_base}_transcribed.txt", "w", encoding="utf-8") as f:
for i, (start, end, text, lang) in enumerate(final_segments, 1):
f.write(f"--- 第 {i} 段 [{lang}] ---\n{text}\n\n")
# ✅ 輸出 SRT
def format_srt_time(seconds):
t = timedelta(seconds=int(seconds))
ms = int((seconds - int(seconds)) * 1000)
return f"{str(t).zfill(8).replace('.', ',')},{ms:03d}"
with open(f"{output_base}.srt", "w", encoding="utf-8") as f:
for i, (start, end, text, _) in enumerate(final_segments, 1):
f.write(f"{i}\n")
f.write(f"{format_srt_time(start)} --> {format_srt_time(end)}\n")
f.write(f"{text}\n\n")
# 🧹 清理 chunks
for f in os.listdir("chunks"):
os.remove(os.path.join("chunks", f))
os.rmdir("chunks")
print(f"\n✅ 已完成!輸出:\n- {output_base}_transcribed.txt\n- {output_base}.srt")
print(f"⏱ 總耗時:{timedelta(seconds=int(time.time() - start_time))}")
你也可以建立一個 whisper.command
(或 whisper.sh
)檔案,直接透過雙擊執行:
#!/bin/zsh
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(/opt/homebrew/bin/brew shellenv)"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
# 使用 pyenv 虛擬環境 whisper-env
pyenv activate whisper-env
cd /Users/cindy/Documents/whisper
python whisper_transcribe.py
echo ""
read -k 1 -s "?✅ 執行完成,按任意鍵關閉..."
你可以將此檔案儲存為 whisper.command
並給予執行權限:
chmod +x whisper.command
之後即可直接點擊執行。
📁 範例輸出
本範例為使用 MacBook Air M2(8GB RAM) 處理一段約 47 分鐘 的會議錄音,並產出逐字稿與字幕檔案。轉錄過程約耗時 3 分 8 秒,MPS 加速模式啟用成功。
🛠️ 常見問題 FAQ
Q1. 為什麼 m4a 格式讀不到?
請確認 ffmpeg 已安裝,或手動轉檔為 wav:
ffmpeg -i input.m4a -ac 1 -ar 16000 output.wav
Q2. 可以加上說話者辨識嗎?
這版本尚未整合 Speaker Diarization,但可結合 WhisperX 或 pyannote.audio 延伸。
Q3. 可以轉成繁體中文嗎?
可以,在輸出 .txt
後使用 OpenCC 或人工翻譯工具轉換簡體為繁體。
Q4. 轉檔後有漏段落?(2025/06/09 update)
轉換過程因為RAM容量不足被系統強制結束掉該段落,尤其是我手上的機器RAM只有8G,這個問題不會特別報錯,回頭審查錄音內容才發現這個問題.
解決方式是在切斷檔案的長度縮更小,大概30s一段應該可以避免掉這個問題.
如果你覺得這篇文章對你有幫助,歡迎收藏或留言交流!
撰寫日期:2025-04-16|作者:KingChang with ChatGPT
留言
張貼留言