Marker学习笔记(3): 拆分 markdown 文件

背景

Marker 生成的 markdown 文件是一个包含 pdf 所有内容的单个文件,当 pdf 文件内容比较多时,这个 markdown 文件的尺寸会比较大,影响阅读和后续处理(比如翻译)。

如下图,原 pdf 文件有 680 多页,转换得到的 markdown 文件足足有 619K :

ls -lh
-rwxr-xr-x 1 sky sky 619K  8月 24 21:03 adobe-photoshop-book-photographers-2nd.md
-rwxr-xr-x 1 sky sky  46K  8月 23 10:25 adobe-photoshop-book-photographers-2nd_meta.json

因此考虑将这个大 markdown 文件拆分为若一组小一点的 markdown 文件。很显然,按照章节进行拆分是比较合理的。

bash脚本

用 bash 脚本实现这个功能会很简单:

#!/bin/bash

# TODO: update keywords for different markdown files
chapter_begin_keywords[0]="introduction"  # will be ignored for splitting and just used for naming
chapter_begin_keywords[1]="# The Essentials Of Camera Raw" 
chapter_begin_keywords[2]="# Camera Raw–Beyond The Basics" 
chapter_begin_keywords[3]="# Masking Miracles"
chapter_begin_keywords[4]="# Correcting Lens Problems"
chapter_begin_keywords[5]="# Working With Layers"
chapter_begin_keywords[6]="# Making Selections"
chapter_begin_keywords[7]="# Black & White, Duotones & More"
chapter_begin_keywords[8]="# Cropping & Resizing"
chapter_begin_keywords[9]="# Retouching Portraits"
chapter_begin_keywords[10]="# Removing Distracting Stuff"
chapter_begin_keywords[11]="# Photoshop Effects"
chapter_begin_keywords[12]="# Sharpening Techniques"

echo "Preparing, reading keywords for chapters......"
echo ""

chapter_names=()
index=0
while(( $index<${#chapter_begin_keywords[*]} ))
do
    # build chapter name by keywords and index
    # remove "# "
    chapter_name=${chapter_begin_keywords[$index]//# /}
    # remove "& "
    chapter_name=${chapter_name//& /}
    # remove ","
    chapter_name=${chapter_name//,/}
    # replace " " with "-"
    chapter_name=${chapter_name// /-}
    # lowcase all
    chapter_name=${chapter_name,,}
    # add "chapterxx" prefix
    if [[ $index -lt 10 ]];
    then
        chapter_name="chapter0${index}_${chapter_name}"
    else
        chapter_name="chapter${index}_${chapter_name}"
    fi
    # add ".md" suffix
    chapter_name="${chapter_name}.md"
    chapter_names[${index}]=${chapter_name}

    echo "chapter ${index} begin keyword=${chapter_begin_keywords[$index]}"
    echo "chapter name=${chapter_names[$index]}"
    echo ""
    let "index++"
done

# clean files before execution
echo "clean chapter files if exists"
rm -f chapter*.md
rm -f todo.md todo-temp.md


echo ""
echo "Begin to split file $1 by keywords......"
echo ""
cp $1 todo.md

index=0
while(( $index < (${#chapter_begin_keywords[*]} - 1 )))
do
    cat todo.md | grep -B 1000000000 "${chapter_begin_keywords[$index+1]}" > "${chapter_names[$index]}"
    cat todo.md | grep -A 1000000000 "${chapter_begin_keywords[$index+1]}" > todo-temp.md
    rm -f todo.md
    mv todo-temp.md todo.md
    echo "Succeed to split out ${chapter_names[$index]}"
    let "index++"
done

# last chapter will be saved in todo.md, just rename it!
mv todo.md ${chapter_names[$index]}
echo "Succeed to split out ${chapter_names[$index]}"

echo ""
echo "Done!"
echo ""

echo "Please check the output files:"
echo ""

ls -lh chapter*.md

echo ""

这个脚本在使用前,需要为要拆分的 markdown 文件提供每个章节开始部位的关键字,比如之前生成的 adobe-photoshop-book-photographers-2nd.md 文件,每个章节都会以类似 “# The Essentials Of Camera Raw” 的方式开始,

为了从 markdown 文件中找到每个章节开始的关键字, 可以用下面的命令先做一次筛选:

cat ./adobe-photoshop-book-photographers-2nd.md | grep "# " | grep -v "## "

然后结合 pdf 的章节标题,就能很快找出来各个章节开始的关键字:

# TODO: update keywords for different markdown files
chapter_begin_keywords[0]="introduction"  # will be ignored for splitting and just used for naming
chapter_begin_keywords[1]="# The Essentials Of Camera Raw" 
chapter_begin_keywords[2]="# Camera Raw–Beyond The Basics" 
chapter_begin_keywords[3]="# Masking Miracles"
chapter_begin_keywords[4]="# Correcting Lens Problems"
chapter_begin_keywords[5]="# Working With Layers"
chapter_begin_keywords[6]="# Making Selections"
chapter_begin_keywords[7]="# Black & White, Duotones & More"
chapter_begin_keywords[8]="# Cropping & Resizing"
chapter_begin_keywords[9]="# Retouching Portraits"
chapter_begin_keywords[10]="# Removing Distracting Stuff"
chapter_begin_keywords[11]="# Photoshop Effects"
chapter_begin_keywords[12]="# Sharpening Techniques"
......

附件: split.sh文件

手工调整

拆分过程并不会完美,比如在章节关键字之前,可能会有几张图片。这些图片的内容目前会被拆分到上一个章节中,需要手工进行调整。另外书籍末尾的 index 章节往往没有意义,通常会考虑删除。

谨慎起见,手工调整的过程中,也可以对照 pdf 原文核对一下拆分是否准确。

结论

虽然有 AI 的介入,但是有些还是需要人工介入,这也就意味着工作量,希望 AI 以后可以做的更加的智能。

敖小剑
敖小剑
新时代农民工 * 中年码农

我目前研究的方向主要在Microservice、Servicemesh、Serverless等Cloud Native相关的领域,全职从事Dapr开发,欢迎交流和指导。