我目前正在使用iTextSharp从PDF文件中提取文本 .
有几十个PDF可以正常工作,但是其中2个PDF会在[1]处抛出一个无效的强制转换Stacktrace .
抛出此异常的代码如下(GetTextFromPage抛出的异常):
PdfReader reader = new PdfReader(byteArray);
PdfTextExtractor.GetTextFromPage(reader, 1, new SimpleTextExtractionStrategy());
一些额外的说明:
-
Adobe Acrobat中的预检语法检查未发现任何错误 .
-
生成此错误的示例PDF位于:http://resources.mpi-inf.mpg.de/DisparityModel(论文(Adobe Acrobat PDF,6.69 MB) . )
-
我已经尝试了LocationTextExtractionStrategy - 同样的错误 .
在Preflight旁边,如何检查PDF文件是否已损坏?或者这个错误来自哪里?
[1]
System.InvalidCastException was unhandled
HResult=-2147467262
Message=Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'.
Source=itextsharp
StackTrace:
at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
at iTextSharp.text.pdf.DocumentFont.Init()
at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
at ConsoleApplication1.Program.Main(String[] args) in e:\foobar\projects\AnalyzePDF\ConsoleApplication1\ConsoleApplication1\Program.cs:line 24
at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
InnerException:
1 回答
有问题的文档包含带有以下 ToUnicode Map 的字体:
iText(夏普)绊倒的部分是:
即一个由
beginbfchar
开始并以不匹配的endcmap
结束的部分 .我认为由
beginbfchar
开始的部分总是必须以endbfchar
结尾 .有问题的字体是Calibri子集复合字体 . 它在第一页上用作 Fm0 的xobject形式使用 . 那个xobject有一个字典条目
所以它可能已从该Teaser.pdf文件中复制 .