使用iTextSharp提取文本会引发InvalidCastException-Java 学习之路

我目前正在使用iTextSharp从PDF文件中提取文本 .

有几十个PDF可以正常工作，但是其中2个PDF会在[1]处抛出一个无效的强制转换Stacktrace .

抛出此异常的代码如下（GetTextFromPage抛出的异常）：

PdfReader reader = new PdfReader(byteArray);
        PdfTextExtractor.GetTextFromPage(reader, 1, new SimpleTextExtractionStrategy());

一些额外的说明：

Adobe Acrobat中的预检语法检查未发现任何错误 .
生成此错误的示例PDF位于：http://resources.mpi-inf.mpg.de/DisparityModel（论文（Adobe Acrobat PDF，6.69 MB） . ）
我已经尝试了LocationTextExtractionStrategy - 同样的错误 .

在Preflight旁边，如何检查PDF文件是否已损坏？或者这个错误来自哪里？

[1]

System.InvalidCastException was unhandled
  HResult=-2147467262
  Message=Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'.
  Source=itextsharp
  StackTrace:
       at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
       at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
       at iTextSharp.text.pdf.DocumentFont.Init()
       at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
       at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
       at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
       at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
       at ConsoleApplication1.Program.Main(String[] args) in e:\foobar\projects\AnalyzePDF\ConsoleApplication1\ConsoleApplication1\Program.cs:line 24
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException:

1 回答

有问题的文档包含带有以下 ToUnicode Map 的字体：

/CIDInit /ProcSet findresource
begin
12 dict
begin
/CIDSystemInfo <</Ordering (UCS) /Registry (Adobe) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffffffffffffffff> endcodespacerange
20 beginbfchar
<0003> <0020> <0012> <0043> <0018> <0044> <0045> <004e> <0059> <0051> <005e> <0053> <0102> <0061> <0110> <0063> <011a> <0064> <011e> <0065> <015d> <0069> <0175> <006d> <0176> <006e> <017d> <006f> <01ffffff89> <0070> <01ffffff8c> <0072> <01ffffff90> <0073> <01ffffff9a> <0074> <01ffffffb5> <0075> <01ffffffc7> <0079> endbfchar
100 beginbfchar
<01ffffffcc> <007a> endcmap
CMapName
currentdict
/CMap defineresource
pop
end
end
ý

iText（夏普）绊倒的部分是：

100 beginbfchar
<01ffffffcc> <007a> endcmap

即一个由 beginbfchar 开始并以不匹配的 endcmap 结束的部分 .

我认为由 beginbfchar 开始的部分总是必须以 endbfchar 结尾 .

有问题的字体是Calibri子集复合字体 . 它在第一页上用作 Fm0 的xobject形式使用 . 那个xobject有一个字典条目

/PTEX.FileName (C:/MyFiles/Publications/DisparityMetric/Figures/Teaser.pdf)

所以它可能已从该Teaser.pdf文件中复制 .

回复于 2024-04-19T00:08:32+08:00

使用iTextSharp提取文本会引发InvalidCastException

1 回答

相关问题