使用Apache POI和Apache PDFBox阅读doc，pdf文件时错误定位的文本框-Java 学习之路

我试图通过使用 Apache POI （对于doc，docx）和 Apache PDFBox （对于pdf）库将它们转换为单个字符串来阅读和处理Java中的.doc，.docx，.pdf文件 .
它工作正常，直到遇到文本框 . 如果格式是这样的： paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3 那么输出应该是：
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3 但我得到的输出是：
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2 它似乎是在最后添加文本框而不是它应该在的地方，即段落之间 . 在doc和pdf文件的情况下都存在此问题 . 这意味着库，POI和PDFBox都会出现同样的问题 .

阅读pdf文件的代码是：

void pdf(String file) throws IOException {
        //Initialise file
        File myFile = new File(file);
        PDDocument pdDoc = null;
        try {
            //Load PDF
            pdDoc = PDDocument.load(myFile);
            //Create extractor
            PDFTextStripper pdf = new PDFTextStripper();
            //Extract text
            output = pdf.getText(pdDoc);
        }
        finally {
            if(pdDoc != null)
                //Close document
                pdDoc.close();
        }
    }

doc文件的代码是：

void doc(String file) throws FileNotFoundException, IOException {
        File myFile = null;
        WordExtractor extractor = null ;
        //initialise file
        myFile = new File(file);
        //create file input stream
        FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
        //open document
        HWPFDocument document=new HWPFDocument(fis);
        //create extractor
        extractor = new WordExtractor(document);
        //get text from document
        output = extractor.getText();
    }

2 回答

2

对于PDFBox，请执行以下操作：pdf.setSortByPosition（true）;

回复于 2024-04-28T20:09:29+08:00

请尝试下面的pdf代码 . 以类似的方式，您也可以尝试使用doc .

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

回复于 2024-04-28T20:09:29+08:00

使用Apache POI和Apache PDFBox阅读doc，pdf文件时错误定位的文本框

2 回答

相关问题