初试PDFMathTranslate

背景

项目地址:

https://github.com/Byaidu/PDFMathTranslate

windows 安装(家用电脑)

用自己家里的普通台式机,windows 11 ltsc 2024 版本。

安装:

pip install pdf2zh

启动 GUI 界面:

pdf2zh -i

一切顺利,除了需要做好全局科学上网的准备。

windows 安装(公司笔记本)

备注:公司笔记本,可能有一些特别的限制,导致中间遇到很多问题,记录下来仅供参考。

操作系统是 windows 11 23h2。

安装python

https://skyao.io/post/202408-marker-setup-on-windows/

参考 marker 安装 python,为了统一python 版本,我开始选择了和 marker 一样使用 python 3.10 版本。

安装 PDFMathTranslate

安装很简单:

pip install pdf2zh

顺利安装好 pdf2zh,但随后在使用中遇到一堆的问题,暂时纪录如下。

typing_extensions报错

首先是 typing_extensions报错:

ImportError: cannot import name 'TypeIs' from 'typing_extensions'

后来unisntall typing_extensions,再重新install,又莫名其妙的好了。

pip unintall typing_extensions
pip install typing_extensions

参考:

onnx报错

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\sky\AppData\Local\Programs\Python\Python311\Scripts\pdf2zh.exe\__main__.py", line 4, in <module>
  File "C:\Users\sky\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdf2zh\__init__.py", line 2, in <module>
    from pdf2zh.high_level import translate, translate_stream
  File "C:\Users\sky\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdf2zh\high_level.py", line 15, in <module>
    from pdf2zh.doclayout import DocLayoutModel
  File "C:\Users\sky\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdf2zh\doclayout.py", line 5, in <module>
    import onnx
  File "C:\Users\sky\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnx\__init__.py", line 77, in <module>
    from onnx.onnx_cpp2py_export import ONNX_ML
ImportError: DLL load failed while importing onnx_cpp2py_export: 动态链接库(DLL)初始化例程失败。

google到说需要降级版本:

onnx 的版本从这里看: https://github.com/onnx/onnx/releases

默认安装的是最新的版本 v1.17.0,尝试降级为 v1.16.2继续报错,降级到 v1.16.1 不再报错。

pip uninstall onnx
pip install onnx==1.16.1

我后来提交了一个issue,确认的确是在某些情况下会出现这个问题,降级到 v1.16.1 也的确是目前最方便的处理方式。

参见: https://github.com/Byaidu/PDFMathTranslate/issues/423

huggingface_hub报错

huggingface_hub 报错:

......
  File 
  "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\file_download.py", line 301, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils\_http.py", line 93, in send
    return super().send(request, *args, **kwargs)
  File "C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py", line 698, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='cdn-lfs-us-1.hf.co', port=443): Max retries exceeded with url: /repos/f5/94/f594dea68dc4fa80d9460b7731310af7a671baf0a48e1186d37a2fab95e2db7e/fece9af02f618b603ff7921ccec6861d13e7e1f9830e091dfb7e8ad9311e5b21?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27doclayout_yolo_docstructbench_imgsz1024.onnx%3B+filename%3D%22doclayout_yolo_docstructbench_imgsz1024.onnx%22%3B&Expires=1736472113&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNjQ3MjExM319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2Y1Lzk0L2Y1OTRkZWE2OGRjNGZhODBkOTQ2MGI3NzMxMzEwYWY3YTY3MWJhZjBhNDhlMTE4NmQzN2EyZmFiOTVlMmRiN2UvZmVjZTlhZjAyZjYxOGI2MDNmZjc5MjFjY2VjNjg2MWQxM2U3ZTFmOTgzMGUwOTFkZmI3ZThhZDkzMTFlNWIyMT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=DTrQPAuPIXVMZ-etTicFBQdSs18wXA-Y2k6QLO5fvZTwcjq7B1skYBY0uF3ejbrHuAzhcDzcQ0VeSKu4uFUgnzt7UjQUgIN6ulWa74UA7ld2WC6N2lFvs0yw73oe0Tc14jL-NocPZBhY~f6LCmSlNPepJrx9zxDYHGlfNUDXL3Tgzzmb9rZBaAjcuGodQmYtzmI73RKEu77HaIWPgQn2kQjZyC2f3emApmgnCAYo4NHjcoDE-geW2zLb3evpdPgOvGf4PJg6F-P8ri4vidoTcsQuKkYXd3~Z5EO-M3RFPyncdIhXuX1LPD4MS3mLkoJWCVcJCZ51drmJIj9HzQZsPA__&Key-Pair-Id=K24J24Z295AEI9 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1007)')))"), '(Request ID: a96506d9-ed3a-4ea1-be78-a4ce48fea0ff)')

CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain 这是自签名证书没有通过验证。

按说最简单的方式就是强制不要做验证,尝试设置环境变量:

export REQUESTS_CA_BUNDLE=""
export CURL_CA_BUNDLE=""

没能禁止验证,依然报错。

参考:

用最笨最粗暴的办法,临时修改 requests 的代码来绕开这个报错。打开文件 C:\Users\xxxx\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py,找到 698 行附近:

def send(
        self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None
    ):
        """Sends PreparedRequest object. Returns Response object.

        :param request: The :class:`PreparedRequest <PreparedRequest>` being sent.
        :param stream: (optional) Whether to stream the request content.
        :param timeout: (optional) How long to wait for the server to send
            data before giving up, as a float, or a :ref:`(connect timeout,
            read timeout) <timeouts>` tuple.
        :type timeout: float or tuple or urllib3 Timeout object
        :param verify: (optional) Either a boolean, in which case it controls whether
            we verify the server's TLS certificate, or a string, in which case it
            must be a path to a CA bundle to use
        :param cert: (optional) Any user-provided SSL certificate to be trusted.
        :param proxies: (optional) The proxies dictionary to apply to the request.
        :rtype: requests.Response
        """
        
        """
        增加这一行,强制将verify设置为false
        """
        verify = False
        ......

通过这么一个粗暴的方式终于临时绕开自签名,顺利打开基于 web 的 GUI 界面。

后来对比了一下,发现问题是出现在代理服务器上,如果通过 all_proxy / http_proxy 等方式指定了代理,则就会如上报错。

但没有代理又无法访问huggingface_hub,因此解决的方式是本机不要设置代理,在路由器上采用自动代理,或者在本地开启代理软件设置为全局代理模式。均可规避上述错误。

OpenCV报错

尝试提交一个 pdf 文件进行转换,结果继续报错,这回是 OpenCV:

Files before translation: ['agents-long-game-ai-computational.pdf']
{'files': ['pdf2zh_files\\agents-long-game-ai-computational.pdf'], 'pages': [0, 1, 2, 3, 4], 'lang_in': 'en', 'lang_out': 'zh', 'service': 'bing', 'output': WindowsPath('pdf2zh_files'), 'thread': 4, 'callback': <function translate_file.<locals>.progress_bar at 0x000001FA6FE9AF80>}
 20%|█████████████████████████▏                                                                                                    | 1/5 [00:00<00:00, 28.67it/s]
Traceback (most recent call last):
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 2045, in process_api
    result = await self.call_function(
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1592, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\gui.py", line 165, in translate_file
    translate(**param)
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\high_level.py", line 278, in translate
    s_mono, s_dual = translate_stream(s_raw, **locals())
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\high_level.py", line 213, in translate_stream
    obj_patch: dict = translate_patch(fp, **locals())
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\high_level.py", line 117, in translate_patch
    page_layout = model.predict(image, imgsz=int(pix.height / 32) * 32)[0]
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\doclayout.py", line 149, in predict
    pix = self.resize_and_pad_image(image, new_shape=imgsz)
  File "C:\Users\sky\AppData\Local\Programs\Python\Python310\lib\site-packages\pdf2zh\doclayout.py", line 103, in resize_and_pad_image
    image = cv2.resize(
cv2.error: OpenCV(4.10.0) :-1: error: (-5:Bad argument) in function 'resize'
> Overload resolution failed:
>  - src is not a numpy array, neither a scalar
>  - Expected Ptr<cv::UMat> for argument 'src'

简直无语。考虑到在我家里的台式机电脑上没有遇到这个问题,所以怀疑可能又是某种不兼容,尝试卸载 python 3.10,然后删除 C:\Users\sky\AppData\Local\Programs\Python\Python310\ 下的所有内容。重新安装 python 3.11.9 版本,再次重新安装 pdf2zh。

这个问题又莫名其妙的消失了。

使用

GUI 界面

通过执行 pdf2zh -i 命令可以打开基于 web 的 GUI 界面:

 pdf2zh -i
* Running on local URL:  http://0.0.0.0:7860
Error launching GUI using 0.0.0.0.
This may be caused by global mode of proxy software.
Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
---
Error launching GUI using 127.0.0.1.
This may be caused by global mode of proxy software.
Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
* Running on public URL: https://0ed022102288ab69fb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)

我这里因为是在本地通过代理软件开启了全局代理模式,因此无法使用 0.0.0.0 和 127.0.0.1 打开界面。最后通过 https://0ed022102288ab69fb.gradio.live 地址打开了浏览器,界面如下:

这个界面可以在没有科学上网的情况下打开,但是提交pdf文件开始翻译之后,preview界面那边是需要科学上网的,否则会报错 connection timeout。

比较遗憾的是,我即使开启了科学上网,也没有成功。

TODO:稍后到网络条件更好的地方再试一下。

命令行

通过命令行也可以调用 pdf2zh 来进行 pdf 文件的翻译,如:

pdf2zh ./applications-challenges-future-chatgpt.pdf -p 1-100

顺利完成,此时在当前目录下,除了原英文版本的 pdf 文件外,还有出现两个新生成的 pdf 文件:

$ ls *.pdf
applications-challenges-future-chatgpt-dual.pdf
applications-challenges-future-chatgpt-mono.pdf
applications-challenges-future-chatgpt.pdf
  • applications-challenges-future-chatgpt-mono.pdf: 中文翻译版本
  • applications-challenges-future-chatgpt-dual.pdf:中英文对照版本,即一页中文,一页英文,方便对照。

翻译的效果,只能说还行,借助于最新的人工智能翻译引擎,翻译后的内容可读性还算可以。如果要求不高,只是作为一个快速阅读通览全文的手段,不苛求细节,还是很不错的。至少我个人还是比较满意的。

缺点自然也是有的,毕竟不能和专业人员手工翻译和多次校对,出版社精细排版的翻译成书相比。内容多少有些机翻的味道(其实已经很好了,和过去相比),排版方面也有各种瑕疵。但怎么说呢,瑕不掩瑜吧。

关键是这个翻译是真很方便,真的很快。尤其是网上经常有大量的英文技术书籍出版,这些书籍引入到国内翻译完成到出版上市,和英文原版相比最少要晚1-2年,在技术日新月异的今天,两年时间会造成技术书籍的时效性大减。

另外,网上通常很快就会有这些新出版书籍的 pdf 格式文件可供下载,拿到这些英文原版 pdf 之后,通过 pdf2zh 工具进行快速翻译,可以立即得到一个不完美但是基本可读的中文翻译版本,还是很不错的。毕竟中文是母语,可以一目十行的快速浏览。

限速问题

但要注意,如果 pdf 文件比较长,页数比较多,则可能会中途报错,如:

ERROR:pdf2zh.converter:HTTPConnectionPool(host='translate.google.com', port=80): 
Max retries exceeded with url: /m?tl=zh-CN&sl=en&q=The+emergence+of+powerful+conversational+AI+systems+such+as+ChatGPT+demonstrates+the+fast+growth++of+technology+and+its+capacity+to+change+the+way+we+operate.+While+it+is+true+that+such+technology+may++eliminate+certain+employment%2C+it+has+the+ability+to+generate+new+possibilities+and+increase+efficiency+in++a+variety+of+industries.+However%2C+it+is+crucial+to+acknowledge+the+necessity+for+effective+implementa-+tion+and+regulation+to+guarantee+that+these+technologies+are+utilized+ethically+and+responsibly+%28Rasul%2C%2C++et+al%2C2023%2C+pp.1-6%29. 
(Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x00000240B7CEEDD0>, 
'Connection to translate.google.com timed out. (connect timeout=None)'))

应该是默认使用的 google 翻译有限速,连续太多次调用被限制使用了。

参见:https://github.com/Byaidu/PDFMathTranslate/issues/424

换用 bing 试了一下 ,速度比 google 慢很多,中途还报告了几次错误,但好在有惊无险的最终完成了全部333页 pdf 的翻译:

 pdf2zh ./applications-challenges-future-chatgpt.pdf -s bing
  6%|█████                                                                               | 20/333 [00:27<14:25,  2.76s/it]ERROR:pdf2zh.converter:'translations'
  9%|███████▎                                                                            | 29/333 [00:46<08:25,  1.66s/it]ERROR:pdf2zh.converter:'translations'
 36%|██████████████████████████████▏                                                    | 121/333 [04:33<09:04,  2.57s/it]ERROR:pdf2zh.converter:'translations'
 63%|████████████████████████████████████████████████████▌                              | 211/333 [08:56<05:15,  2.58s/it]ERROR:pdf2zh.converter:'translations'
 93%|█████████████████████████████████████████████████████████████████████████████      | 309/333 [13:00<01:13,  3.07s/it]ERROR:pdf2zh.converter:'translations'
100%|███████████████████████████████████████████████████████████████████████████████████| 333/333 [14:38<00:00,  2.64s/it]

继续对比deepl,翻译前看了一下我的 deepl free 账户上还剩余 46万个免费字符。很遗憾,46万字符的容量连翻译一个333页的 pdf 文件都不够,在配额耗尽之后报错:

......
ERROR:pdf2zh.converter:Quota for this billing period has been exceeded, message: Quota Exceeded
ERROR:pdf2zh.converter:Quota for this billing period has been exceeded, message: Quota Exceeded
ERROR:pdf2zh.converter:Quota for this billing period has been exceeded, message: Quota Exceeded
ERROR:pdf2zh.converter:Quota for this billing period has been exceeded, message: Quota Exceeded
ERROR:pdf2zh.converter:Quota for this billing period has been exceeded, message: Quota Exceeded

deepl的配额问题可以通过升级到 pro 订阅来解决,看描述是可以按需付费,10万字符25美元大概180多人民币。感觉这个费用偶尔翻译一些不大的 pdf 内容可以接受,但用来整本整本的翻译 pdf 格式的技术书籍完全无法承受。

还是继续挖掘可以免费使用的 bing 吧。

可选参数

无论是 GUI 还是命令行, 都有不少参数可供选择。

这是 pdf2zh 的命令行帮助的输出:

$ pdf2zh --help
usage: pdf2zh [-h] [--version] [--debug] [--pages PAGES] [--vfont VFONT]
              [--vchar VCHAR] [--lang-in LANG_IN] [--lang-out LANG_OUT]
              [--service SERVICE] [--output OUTPUT] [--thread THREAD]
              [--interactive] [--share] [--flask] [--celery]
Parser:
  Used during PDF parsing

  --pages PAGES, -p PAGES
                        The list of page numbers to parse.
  --vfont VFONT, -f VFONT
                        The regex to math font name of formula.
  --vchar VCHAR, -c VCHAR
                        The regex to math character of formula.
  --lang-in LANG_IN, -li LANG_IN
                        The code of source language.
  --lang-out LANG_OUT, -lo LANG_OUT
                        The code of target language.
  --service SERVICE, -s SERVICE
                        The service to use for translation.
  --output OUTPUT, -o OUTPUT
                        Output directory for files.
  --thread THREAD, -t THREAD
                        The number of threads to execute translation.
  --interactive, -i     Interact with GUI.
  --share               Enable Gradio Share
  --flask               flask
  --celery              celery

其中最重要的是选择用于翻译的服务,可选项有:

  • Google
  • Bing
  • DeepL
  • DeepLX
  • Ollama
  • AzureOpenAI
  • OpenAI
  • Zhipu
  • Silicom
  • Gemini
  • Azure
  • Tencent

注意在命令行中需要用小写,如:

$ pdf2zh ./applications-challenges-future-chatgpt.pdf -p 1-10 -s google
# 耗时4秒

$ pdf2zh ./applications-challenges-future-chatgpt.pdf -p 1-10 -s bing
# 耗时32秒

对于 auth_key 等额外参数的,需要通过环境变量来传递,如 deepl:

$ DEEPL_SERVER_URL=https://api-free.deepl.com DEEPL_AUTH_KEY=84416fef-xxxx-xxxx-xxxx-xxxxxxxf3:fx pdf2zh ./applications-challenges-future-chatgpt.pdf -p 1-10 -s deepl
# 耗时16秒

具体有哪些环境变量要设置,没有看到文档,估计只能翻代码了。我是在 issue 中偶尔看到的:

总结

首先说优点:

  • 能用:虽然多少有机翻的味道和排版不够理想,但起码是能入目的,从务实的角度看足以满足快速翻译/快速阅读的基本目标
  • 便捷:理论上一个命令就能完成全部翻译工作,对比我之前用 marker 将 pdf 转 markdown,再人工纠正排版,然后机翻+人工校对,速度快了几十倍。

然后说缺点:

  • 安装和运行有些莫名其妙的问题,很看人品;遇到问题时,需要有自己解决问题的能力
  • 机翻难免有些机翻的味道,在所难免,这应该算是翻译引擎的问题
  • 排版有瑕疵,有些甚至有些无厘头,希望可以改进
  • 翻译引擎的选择难题,免费的有限制,收费的很贵。唯一欣喜的是 bing 即免费又没限制,简直良心。
  • 最重要的:文档极其匮乏,遇到问题只能自己去 issue 中碰运气和 google。

参考资料

敖小剑
敖小剑
新时代农民工 * 中年码农

我目前研究的方向主要在Microservice、Servicemesh、Serverless等Cloud Native相关的领域,全职从事Dapr开发,欢迎交流和指导。