不久前离开上家公司了,又规复了自由身,之前的几个事情都险些是无缝切换,少了一些思考,这一次决定先好好想想,可以放松的搞一点自己以为好玩的东西。买了个 NAS,创造事情中的 IT 技能终于用到了生活中,个中首先是关于电影的中笔墨幕。
拿到 NAS 的第一步便是开始猖獗的下载 4K 电影,这些电影都自带字幕,不过有些不带中笔墨幕,或者翻译的不好。再加上我买的 NAS 软件功能不全,中笔墨幕下载比较麻烦,以是我希望有一个自动化的方案。经由评估,我以为可以利用现在的 ChatGPT 和 Gemini 之类的 AI 翻译英笔墨幕,该当会有不错的效果。
利用 Poetry 管理项目这几年没怎么搞过 Python 项目,但是看到有一些项目用到了 poetry ,以是决定这个项目用起来。试用觉得很不错,远超之前用过的 pipenv 。
我的 pyproject.toml 文件内容如下:
[tool.poetry]name = 34;upbox"version = "0.1.0"description = ""authors = ["rocksun <daijun@gmail.com>"]readme = "README.md"[tool.poetry.dependencies]python = "^3.10"ffmpeg-python = "^0.2.0"llama-index = "^0.10.25"llama-index-llms-gemini = "^0.1.6"pysubs2 = "^1.6.1"# yt-dlp = "^2024.4.9"# typer = "^0.12.3"# faster-whisper = "^1.0.1"[build-system]requires = ["poetry-core"]build-backend = "poetry.core.masonry.api"
关于 poetry 的利用我这里就不多说了,大家自行学习。这里引用了 ffmpeg 的包装库(须要路径里有 ffmpeg 命令);然后便是 llama-index 和 对应的 Gemini 库,其实用不用 llama-index 差异不大,本文并没有利用太多 llama-index 的功能;末了是字幕处理库 pysubs2,曾经考虑是否直接解析字幕,后来创造用 pysubs2 还是能节省不少韶光。
英笔墨幕提取通过 ffmpeg 提取视频中内嵌的字幕很随意马虎,实行以下命令即可:
ffmpeg -i my_file.mkv outfile.vtt
但实际上一个视频里会有多个字幕,这样并不准确,以是还是要确认下。我还是考虑用一个 ffmpeg 的库,也便是 ffmpeg-python,用这个库提取英笔墨幕的代码如下:
def _guess_eng_subtitle_index(video_path): probe = ffmpeg.probe(video_path) streams = probe['streams'] for index, stream in enumerate(streams): if stream.get('codec_type') == 'subtitle' and stream.get('tags', {}).get('language') == 'eng': return index for index, stream in enumerate(streams): if stream['codec_type'] == 'subtitle' and stream.get('tags', {}).get('title', "").lower().find("english")!=-1 : return index return -1def _extract_subtitle_by_index(video_path, output_path, index): return ffmpeg.input(video_path).output(output_path, map='0:'+str(index)).run()def extract_subtitle(video_path, en_subtitle_path): # get the streams from video with ffprobe index = _guess_eng_subtitle_index(video_path) if index == -1: return -1 return _extract_subtitle_by_index(video_path, en_subtitle_path, index)
增加了 _guess_eng_subtitle_index 方法来确定英笔墨幕的 index,这是由于虽然大多数视频都的字幕 tags 还是比较规范的,但是也确实有一些视频的字幕根本没有 tags,以是只能猜,我估计在实践中还有其他情形,只能根据实际情形应对。
英笔墨幕处理一开始我以为就直接将字幕抛给 Gemini ,然后保存结果就行,但实际上并弗成,有几个问题:
许多英笔墨幕中有许多标签,翻译时会影响效果一个字幕太大,全部抛给 Gemini 处理不了,而且高下文太长实在也随意马虎出问题。字幕中的韶光戳太长,让 prompt 变得太长。为此,我只好增加了一个字幕类 UpSubs 用来处理上面的问题:
class UpSubs: def __init__(self, subs_path): self.subs = pysubs2.load(subs_path) def get_subtitle_text(self): text = "" for sub in self.subs: text += sub.text + "\n\n" return text def get_subtitle_text_with_index(self): text = "" for i, sub in enumerate(self.subs): text += "chunk-"+str(i) + ":\n" + sub.text.replace("\\N", " ") + "\n\n" return text def save(self, output_path): self.subs.save(output_path) def clean(self): indexes = [] for i, sub in enumerate(self.subs): # remove xml tag and line change in sub text sub.text = re.sub(r"<[^>]+>", "", sub.text) sub.text = sub.text.replace("\\N", " ") def fill(self, text): text = text.strip() pattern = r"\n\s\n" paragraphs = re.split(pattern, text) for para in paragraphs: try: firtline = para.split("\n")[0] countstr = firtline[6:len(firtline)-1] # print(countstr) index = int(countstr) p = "\n".join(para.split("\n")[1:]) self.subs[index].text = p except Exception as e: print(f"Error merge paragraph : \n {para} \n with exception: \n {e}") raise(e) def merge_dual(self, subspath): second_subs = pysubs2.load(subspath) merged_subs = SSAFile() if len(self.subs.events) == len(second_subs.events): for i, first_event in enumerate(self.subs.events): second_event = second_subs[i] if first_event.text == second_event.text: merged_event = SSAEvent(first_event.start, first_event.end, first_event.text) else: merged_event = SSAEvent(first_event.start, first_event.end, first_event.text + '\n' + second_event.text) merged_subs.append(merged_event) return merged_subs return None
clean 方法可以大略的清理字幕;save 方法可以用来保存字幕;merge_dual 用来合并双语字幕。这些都比较大略,后面重点说说字幕文本的处理。
原始 srt 文件形式如下:
1200:02:30,776 --> 00:02:34,780Not even the great Dragon Warrior.1300:02:43,830 --> 00:02:45,749Oh, where is Po?1400:02:45,749 --> 00:02:48,502He was supposed to be here hours ago.
经由 get_subtitle_text_with_index 方法会变成:
chunk-12Not even the great Dragon Warrior.chunk-13Oh, where is Po?chunk-14He was supposed to be here hours ago.
这样做是为了减少笔墨数量,减少 chunk 数量。而且,依然能跟踪每一段字幕的编号,通过 fill 方法,我们可以从翻译后的文本还原回字幕。
调用 Gemini调用 Gemini 有几个问题:
须要访问密钥海内访问须要走得当的代理要有一定的容错能力还有要规避 Gemini 的安全机制以是,针对这些问题,专门写了个 complete 方法:
def complete(prompt, max_tokens=32760): prompt = prompt.strip() if not prompt: return "" safety_settings = [ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, ] retries = 3 for _ in range(retries): try: return Gemini(max_tokens=max_tokens, safety_settings=safety_settings, temperature = 0.01).complete(prompt).text except Exception as e: print(f"Error completing prompt: {prompt} \n with error: \n ") traceback.print_exc() return ""
safety_settings 很主要,电影字幕中常常涌现一些特殊敏感的措辞,必须奉告 Gemini 只管即便多容忍。虽然据文档说只有收费账号才能 BLOCK_NONE ,但彷佛对付我翻译电影上述配置没有碰着太多问题,偶尔会碰着一些,但是重试都会消逝。
然后是增加了 3 次重试,调用会偶尔有失落败,重试能办理一些问题。
末了,可以通过 Google AI Studio 获取 API Key。然后在项目增加一个 .env 文件:
程序就可以读取到 API Key 和代理的设置。
调用流程先看一下最外层 tran_subtitles 方法
def tran_subtitles(fixed_subtitle, zh_subtitle=None, cncf = False, chunk_size=3000): subtitle_base = os.path.splitext(fixed_subtitle)[0] video_base = os.path.splitext(subtitle_base)[0] if zh_subtitle is None: zh_subtitle = video_base + ".zh-fixed.vtt" if os.path.exists(zh_subtitle): print(f"zh subtitle {zh_subtitle} already translated, skip to translate.") return 1 prompt_tpl = MOVIE_TRAN_PROMPT_TPL opts = { } srtp = UpSubs(fixed_subtitle) text = srtp.get_subtitle_text_with_index() process_text(srtp, text, prompt_tpl, opts, chunk_size = chunk_size) srtp.save(zh_subtitle) return zh_subtitle
这个逻辑比较大略,读取英笔墨幕,利用 get_subtitle_text_with_index 方法转化为待翻译的文本,然后实行 process_text 方法,完成翻译。提示词模板 prompt_tpl 直接引用了 MOVIE_TRAN_PROMPT_TPL,个中内容为:
MOVIE_TRAN_PROMPT_TPL = """你是个专业电影字幕翻译,你须要将一份英笔墨幕翻译成中文。[须要翻译的英笔墨幕]:{content}# [中笔墨幕]:"""
然后可以关注下 process_text 方法:
def process_text(subs, text, prompt_tpl, opts, chunk_size=2500): # ret = "" chunks = _split_subtitles(text, chunk_size) for(i, chunk) in enumerate(chunks): print("process chunk ({}/{})".format(i+1,len(chunks))) # if i==4: # break # format string with all the field in a dict opts["content"] = chunk prompt = prompt_tpl.format(opts) print(prompt) out = complete(prompt, max_tokens=32760) subs.fill(out) print(out)
通过 _split_subtitles 方法拆分字幕文本为多个 chunk ,然后分别扔给前面说的 complete 方法。
1000:02:22,184 --> 00:02:27,606Let it be known from the highest mountainto the lowest valley that Tai Lung lives,1100:02:27,606 --> 00:02:30,776and no one will stand in his way.1200:02:30,776 --> 00:02:34,780Not even the great Dragon Warrior.1300:02:43,830 --> 00:02:45,749Oh, where is Po?
1000:02:22,184 --> 00:02:27,606让最高的山峰和最低的山谷都知道,泰隆还活着,1100:02:27,606 --> 00:02:30,776没人能阻挡他。1200:02:30,776 --> 00:02:34,780纵然是伟大的神龙大侠也弗成。1300:02:43,830 --> 00:02:45,749哦,阿宝在哪儿?
看到结果出奇的好,我的 prmopt 里也没有供应更多的高下文,Gemini 却给出了隧道的翻译。