首页 文章

使用Java中的Apache Tika从pdf文件中提取文本

提问于
浏览
1
try {
      File file = new File("Example.pdf");
      String content = new Tika().parseToString(file);
      System.out.println("The Content: " + content);
    } catch (Exception e) {
       e.printStackTrace();
    }

我已导入 java.io.File 并导入 org.apache.tika.Tika ;但在运行此代码时,我收到如下错误:

线程“main”中的异常java.lang.NoSuchMethodError:org.slf4j.spi.LocationAwareLogger.log(Lorg / slf4j / Marker; Ljava / lang / String; ILjava / lang / String; Ljava / lang / Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:162)在org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.loadDiskCache(FileSystemFontProvider.java:461)在org.apache.pdfbox.pdmodel . font.FileSystemFontProvider(FileSystemFontProvider.java:217)在org.apache.pdfbox.pdmodel.font.FontMapperImpl $ DefaultFontProvider . (FontMapperImpl.java:130)在org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl . 的java:149)在org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:413)在org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:376)在org.apache位于org.apache.pdfbox.pdm的org.apache.pdfbox.pdmodel.font.PDType1Font . (PDType1Font.java:146)的.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:350) odel.font.PDType1Font . (PDType1Font.java:79)在org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)在org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java :143)org.apache上的org.apache.pdfstream.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) . pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)在org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)在org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java: 150)在org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)在org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)在org.apache.tika.parser.pdf位于org.ap的org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)的.PDF2XHTML.processPage(PDF2XHTML.java:147)位于org.apache.tika.parser.pdf.PDFParser的org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)中的ache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) . 在org.apache的org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)org.apache解析(PDFParser.java:167)org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)位于org.apache.tika.Tika.parseToString(Tika.java:527)的.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)org.apache.tika.Tika.parseToString(Tika.java:642)at at java_programs.PdfParse.main(PdfParse.java:22)

1 回答

  • 2

    以下似乎对我有用 . 我得到了我想要的字符串,但我也在控制台中打印了一些警告 .

    [在Windows上]我编译并运行它:

    javac -cp .;tika-app-1.16.jar Test.java
    
    java -cp .;tika-app-1.16.jar Test
    

    你用的是什么tika jar?我添加了另一种方法( tikaPdfTest() )来显示从PDF中获取可能对您有帮助的文本的不同方法 .

    import java.io.File;
    import org.apache.tika.Tika;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    
    import org.apache.tika.exception.TikaException;
    import org.apache.tika.metadata.Metadata;
    import org.apache.tika.parser.ParseContext;
    import org.apache.tika.parser.pdf.PDFParser;
    import org.apache.tika.sax.BodyContentHandler;
    import org.apache.tika.Tika;
    
    import org.xml.sax.SAXException;
    
    public class Test {
        public static void main(final String[] args) {
            //Your way
            try {
                File file = new File("Example.pdf");
                String content = new Tika().parseToString(file);
                System.out.println("The Content: " + content);
            } catch (final Exception e) {
                e.printStackTrace();
            }
    
            //Another way
            try {
                System.out.println("The contents:\t[" + Test.tikaPdfTest("Example.pdf") + "]");
            } catch (final Exception e) {
                e.printStackTrace();
            }
        }
    
        public static String tikaPdfTest(final String fileName) throws IOException, SAXException, TikaException {
            try(final FileInputStream inputstream = new FileInputStream(new File(fileName))) {
                final BodyContentHandler handler = new BodyContentHandler();
                new PDFParser().parse(inputstream, handler, new Metadata(), new ParseContext());
                return handler.toString().trim();
            }
        }
    }
    

相关问题