从pdf中提取数据-Java 学习之路

请不要标记为重复 . 我已经通过许多Stackoverflow链接，但他们没有解决我的问题 .

What I'm trying to do : 我必须从大约1,50,000个pdf文件中提取数据 .

A sample pdf : 所有这些pdf结构相同，包含表格格式的数据（无图像） . pdf的快照看起来像这样 .

enter image description here

What I've done : 我使用 pdf2htmlEX terminal命令和 Nodejs 将pdf文件转换为html .

var child_process = require('child_process');
var request = require('request');
var spawn = child_process.spawn;

var url = 'http://url_to_extract_data_from_pdf?Id=' + id;    //id ranges from 1 to 1,50,000
var pdfFileStream = fs.createWriteStream(id + '.pdf');

request(url).pipe(pdfFileStream);

pdfFileStream.on('finish', function () {
    console.log('Pdf file downloaded');

    var pdfToHtml = spawn('pdf2htmlEX', [id + '.pdf']);

    pdfToHtml.on('close', function () {
        console.log('Pdf file converted to html');

        jsdom.env({
            url: "http://localhost:1000/" + id + ".html",    //hard coded url for server -> current server running on localhost:1000
            scripts: ["http://code.jquery.com/jquery.js"],
            done: function (err, window) {

                if(err)
                    console.log(err);

                else {
                    var $ = window.$;

                    //jquery selectors to extract data
                    console.log($(".x14.y30").text().trim());
                    console.log($(".x15.y31").text().trim());
                    console.log($(".x16.y32").text().trim());
                }
            }
        });
    });
});

Converted html file looks like this : 类名x后跟一个字符和y后跟一个字符的组合对于特定的div是唯一的 . 例如 . 只有一个div有 xf 和 y10 类 .

enter image description here

Where I'm stuck : 尽管所有pdf的格式和结构都相同，但生成的html文件却没有 . 所以说 $(".x14.y30").text() 可能会给我一些pdf - 1，它会在pdf中给出其他东西 - 2.我也找了一些方法可以修改pdf文件转换为pdf文件的方式HTML . 但一切都是徒劳的 . 然后，需要以标签分隔格式存储提取的数据 .

使用这种方法不是强制性的 . 任何更好的建议都是受欢迎的 .

从pdf中提取数据

相关问题