首页 文章

PdfTextExtractor.GetTextFromPage未返回正确的文本

提问于
浏览
0

使用iTextSharp,我有以下代码,成功地为我正在尝试阅读的大多数PDF中提取PDF文本...

PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
    text += PdfTextExtractor.GetTextFromPage(reader, i);
}
reader.Close();

但是,我的一些PDF有XFA表格(已经填写完毕),这导致'text'字段填充以下垃圾...

"Please wait... \n  \nIf this message is not eventually replaced by the proper contents of the document, your PDF \nviewer may not be able to display this type of document. \n  \nYou can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by \nvisiting  http://www.adobe.com/products/acrobat/readstep2.html. \n  \nFor more assistance with Adobe Reader visit  http://www.adobe.com/support/products/\nacrreader.html. \n  \nWindows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark \nof Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other \ncountries."

我该如何解决这个问题?我尝试使用iTextSharp中的PdfStamper [1]来压缩PDF,但这不起作用 - 结果流具有相同的垃圾文本 .

[1] How to flatten already filled out PDF form using iTextSharp

1 回答

  • 1

    您遇到的PDF充当XML流的容器 . 此XML流基于XML Forms Architecture(XFA) . 您看到的消息是 not garbage! 这是在查看器中打开文档时显示的PDF页面中显示的消息,该文件读取文件就像普通PDF一样 .

    例如:如果您在Apple Preview中打开文档,您将看到完全相同的消息,因为Apple Preview无法呈现XFA表单 . 在使用iText解析文件中包含的PDF时,您收到此消息并不会让您感到惊讶 . 这正是文件中存在的PDF内容 . 在Adobe Reader中打开文档时看到的内容不是以PDF语法存储的,而是存储为XML流 .

    你说你试图压扁PDF,如问题How to flatten already filled out PDF form using iTextSharp的答案中所述 . 但是,这个问题是关于基于AcroForm技术的表单的扁平化 . 它不应该与XFA表单一起使用 . 如果要展平XFA表单,则需要在iText上使用XFA Worker

    [JAVA]

    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
    XFAFlattener xfaf = new XFAFlattener(document, writer);
    xfaf.flatten(new PdfReader(baos.toByteArray()));
    document.close();
    

    [C#]

    Document document = new Document();
    PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create));
    XFAFlattener xfaf = new XFAFlattener(document, writer);
    ms.Position = 0;
    xfaf.Flatten(new PdfReader(ms));
    document.Close();
    

    这种展平过程的结果是普通的PDF,可以通过原始代码进行解析 .

相关问题