获取PDF文档中的页数-Java 学习之路

这个问题用于参考和比较 . 解决方案是下面接受的答案 .

我花了很多时间寻找一种快速简便但却非常准确的方法来获取PDF文档中的页数 . 由于我在一家使用PDF工作的图形打印和复制公司工作，因此在处理文档之前必须准确了解文档中的页数 . PDF文档来自许多不同的客户端，因此他们不会使用相同的压缩方法 .

以下是我找到的一些答案 insufficient 或只是 NOT working ：

使用Imagick（PHP扩展）

Imagick需要大量的安装，apache需要重新启动，当我最终使用它时，处理花了很长时间（每个文档2-3分钟）并且它总是在每个文档中返回 1 页面（没有看到工作到目前为止，Imagick的副本），所以我扔掉了 . 那是 getNumberImages() 和 identifyImage() 方法 .

使用FPDI（PHP库）

FPDI易于使用和安装（只提取文件并调用PHP脚本），FPDI不支持许多压缩技术 . 然后它返回一个错误：

FPDF错误：本文档（test_1.pdf）可能使用FPDI附带的免费解析器不支持的压缩技术 .

打开流并使用正则表达式进行搜索：

这将在流中打开PDF文件，并搜索某种类型的字符串，其中包含pagecount或类似的内容 .

$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));

if(!$stream || !$content)
    return 0;

$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex  = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";

if(preg_match_all($regex, $content, $matches))
    $count = max($matches);

return $count;

/\/Count\s+(\d+)/ （查找 /Count <number> ）不起作用，因为只有少数文档内部有参数 /Count ，因此大多数情况下它不会返回任何内容 . Source.
/\/Page\W*(\d+)/ （寻找 /Page<number> ）没有得到页数，大多包含一些其他数据 . Source.
/\/N\s+(\d+)/ （查找 /N <number> ）也不起作用，因为文档可以包含 /N 的多个值;大多数，如果不是全部的话， not 包含pagecount . Source.

那么，什么工作可靠和准确？请参阅以下答案

8 回答

一个简单的命令行可执行文件，名为：pdfinfo .

这是downloadable for Linux and Windows . 您下载包含几个与PDF相关的小程序的压缩文件 . 在某处提取它 .

其中一个文件是 pdfinfo （或Windows的 pdfinfo.exe ） . 通过在PDF文档上运行它返回的数据示例：

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

我还没有看到一个PDF文档，它返回了一个虚假的页面（尚未） . 它也非常快，即使有200 MB的大文档，响应时间也只需几秒钟或更短 .

有一种从输出中提取页面计数的简单方法，这里是PHP：

// Make a function for convenience 
function getPDFPages($document)
{
    $cmd = "/path/to/pdfinfo";           // Linux
    $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows

    // Parse entire output
    // Surround with double quotes if file name has spaces
    exec("$cmd \"$document\"", $output);

    // Iterate through lines
    $pagecount = 0;
    foreach($output as $op)
    {
        // Extract the number
        if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
        {
            $pagecount = intval($matches[1]);
            break;
        }
    }

    return $pagecount;
}

// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

当然，这个命令行工具可以用于其他语言，可以解析外部程序的输出，但我在PHP中使用它 .

I know its not pure PHP ，但外部程序在PDF处理方面更好（如问题所示） .

我希望这可以帮助人们，因为我花了很多时间试图找到解决方案，我已经看到很多关于PDF页面的问题，其中我找不到我想要的答案 . 这就是我提出这个问题并自己回答的原因 .

回复于 2024-05-02T09:13:36+08:00

15
最简单的是使用 ImageMagick

这是一个示例代码
```
$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();
```
否则你也可以使用 MPDF 库，如 MPDF 或 TCPDF ，用于 PHP
回复于 2024-05-02T09:13:36+08:00
0
如果您有权访问shell，那么最简单的（但不能在100％的PDF上使用）方法就是使用 grep .

这应该只返回页数：
```
grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf
```
示例：https://regex101.com/r/BrUTKn/1

开关说明：
- -m 1 是必要的，因为一些文件可以有多个匹配的正则表达式模式（需要使用volonteer替换匹配的第一个正则表达式解决方案扩展）
- -a 是将二进制文件视为文本的必要条件
- -o 仅显示匹配项
- -P 使用Perl正则表达式
正则表达式解释：
- 开始"delimiter"： (?<=\/N ) lookbehind of /N （nb . 空格字符在这里看不到）
- 实际结果： \d+ 任意位数
- 结束"delimiter"： (?=\/) lookahead of /
Nota bene：如果在某些情况下找不到匹配，则假设只有1页存在是安全的 .
回复于 2024-05-02T09:13:36+08:00
0
如果你不能安装任何额外的包，你可以使用这个简单的单行：
```
foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}$[0-9]\{1,\}$.*|\1|p' | sort -rn | head -n 1)
```
回复于 2024-05-02T09:13:36+08:00

这是一个 R 函数，它使用以下方法报告PDF文件页码 pdfinfo 命令 .

pdf.file.page.number <- function(fname) {
    a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
    page.number <- as.numeric(readLines(a))
    close(a)
    page.number
}
if (F) {
    pdf.file.page.number("a.pdf")
}

回复于 2024-05-02T09:13:36+08:00

这是一个使用gsscript的Windows命令脚本，它报告PDF文件页码

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem

:vars
  set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
  set __lastpagenumber__=1
  set __pdffile__="%~1"
  set __pdffilename__="%~n1"
  set __datetime__=%date%%time%
  set __datetime__=%__datetime__:.=%
  set __datetime__=%__datetime__::=%
  set __datetime__=%__datetime__:,=%
  set __datetime__=%__datetime__:/=% 
  set __datetime__=%__datetime__: =% 
  set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"

:check
  if %__pdffile__%=="" goto error1
  if not exist %__pdffile__% goto error2
  if not exist %__gs__% goto error3

:main
  %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
  FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A 
  set __lastpagenumber__=%__lastpagenumber__: =%
  if exist %__tmpfile__% del %__tmpfile__%

:output
  echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
  goto end

:error1
  echo no pdf file selected
  echo usage: %~n0 PDFFILE
  goto end

:error2
  echo no pdf file found
  echo usage: %~n0 PDFFILE
  goto end

:error3
  echo.can not find the ghostscript bin file
  echo.   %__gs__%
  echo.please download it from:
  echo.   http://www.ghostscript.com/download/
  echo.and install to "C:\prg\ghostscript"
  goto end

:end
  exit /b

回复于 2024-05-02T09:13:36+08:00

R包pdftools和函数 pdf_info() 提供有关pdf中页数的信息 .

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages

$pages
[1] 65

回复于 2024-05-02T09:13:36+08:00

这似乎工作得很好，不需要特殊的包或解析命令输出 .

<?php                                                                               

$target_pdf = "multi-page-test.pdf";                                                
$cmd = sprintf("identify %s", $target_pdf);                                         
exec($cmd, $output);                                                                
$pages = count($output);

回复于 2024-05-02T09:13:36+08:00

获取PDF文档中的页数

这个问题用于参考和比较 . 解决方案是下面接受的答案 .

使用Imagick（PHP扩展）

使用FPDI（PHP库）

打开流并使用正则表达式进行搜索：

8 回答

一个简单的命令行可执行文件，名为：pdfinfo .

相关问题