来自PDFS的高分辨率图像-Java 学习之路

我正在开发一个项目，我需要从多页PDF中提取每页TIFF . PDF仅包含图像，每页有一个图像（我相信它们是在某种复印机/扫描仪上制作的，但尚未证实） . 然后使用TIFF创建文档的其他衍生版本，因此分辨率越高越好 .

我找到了两个食谱，都有一些有用的方面，但都不是理想的 . 希望有人可以帮助我调整其中一个，或提供第三个选项 .

Recipe 1 ，pdfimages和ImageMagick：

先做：

$ pdfimages $MY_PDF.pdf foo"

这导致了几个 .pbm 文件（名为 foo-000.pbm ， foo-001.pbm ）等 .

然后为每个 *.pbm 做：

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

Pro：结果TIFF在长维上是 Health 的3300像素（-resize只用于规范所有事物）

Con：页面的方向丢失，它们以不同的方向旋转（它们遵循逻辑模式，因此它们可能是它们被送入扫描仪的方向？） .

Recipe 2 Imagemagick solo：

convert +adjoin $MY_PDF.pdf pages.tif

这给了我每页TIFF（ pages-0.tif ， pages-1.tif 等） .

亲：定位留下来！

Con：结果文件的长尺寸<800 px，太小而无法使用，看起来好像有一些压缩应用 .

如何在PDF中放弃图像流的缩放，但保留方向？我错过了ImageMagick中的一些魔法吗？还有别的吗？

2 回答

我想分享我的解决方案......它可能不适用于所有人，但因为没有其他任何东西可能会帮助其他人 . 我在我的问题中选择了第一个选项，即使用 pdfimages 来获取每个方向都旋转的大图像 . 然后我找到了一种方法来使用OCR和字数来猜测方向，这使我从（估计）25％精确旋转到90％以上 .

流程如下：

使用 pdfimages （apt-get install poppler-utils）获取一组pbm文件（下面未显示） .
对于每个文件：
制作四个版本，旋转0度，90度，180度和270度（我在代码中将它们称为"north"，"east"，"south"和"west"） .
每个
OCR . 字数最低的两个可能是右侧向上和向上颠倒的版本 . 在我迄今处理的图像集中，这个准确率超过99％ .
从具有最低字数的两个字中，通过拼写检查运行OCR输出 . 具有最少拼写错误的文件（即最可识别的单词）可能是正确的 . 对于我的设置，根据500的样本，这个准确率约为93％（从25％上调） .

因人而异 . 我的文件是双色的和高度文本的 . 源图像的长边平均为3300像素 . 我不能说灰度或颜色，或者有大量图像的文件 . 我的大多数源PDF都是对旧复印件的错误扫描，因此使用更清晰的文件可能会更好 . 在旋转期间使用 -despeckle 没有任何区别，并且显着减慢了速度（~5×） . 我选择了ocrad的速度而不是准确性，因为我只需要粗略的数字而且扔掉了OCR . Re：性能，我没什么特别的Linux台式机可以运行整个脚本大约2-3个文件/秒 .

这是一个简单的bash脚本中的实现：

#!/bin/bash
# Rotates a pbm file in place.

# Pass a .pbm as the only arg.
file=$1

TMP="/tmp/rotation-calc"
mkdir $TMP

# Dependencies:                                                                 
# convert: apt-get install imagemagick                                          
# ocrad: sudo apt-get install ocrad                                               
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"

cp  $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file

# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"

$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text

# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
    txt=$(echo $record | $AWK '{ print $2 }')
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
    echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table

# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')

mv $rotated_file $file

# Clean up.
if [ -d $TMP ]; then
    rm -r $TMP
fi

回复于 2024-05-04T13:21:11+08:00

2

很抱歉这个老主题的噪音，但谷歌把我带到这里作为最好的结果之一，它可能需要其他人，所以我想我在这里找到的问题：http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

简而言之：您必须告诉ImageMagick它应扫描PDF的密度 .

因此 convert -density 600x600 foo.pdf foo.png 会告诉ImageMagick将PDF视为具有600dpi分辨率，从而输出更大的PNG . 在我的例子中，生成的foo.png大小为5000x6600px . 您可以选择添加 -resize 3000x3000 或您需要的任何尺寸，它将按比例缩小 .

请注意，只要您的PDF文件中只有矢量图像或文本，密度可能会根据需要设置为高 . 如果PDF包含光栅化图像，如果将其设置为高于那些图像的dpi，则效果会不理想！ :)

克里斯

回复于 2024-05-04T13:21:11+08:00

来自PDFS的高分辨率图像

2 回答

相关问题