首页 文章

用于将PDF转换为文本的Python模块[关闭]

提问于
浏览
350

哪些是将PDF文件转换为文本的最佳Python模块?

13 回答

  • 133

    试试PDFMiner . 它可以从PDF文件中提取HTML,SGML或“Tagged PDF”格式的文本 .

    http://www.unixuser.org/~euske/python/pdfminer/index.html

    Tagged PDF格式似乎是最干净的,剥离XML标签只留下裸文本 .

    Python 3版本可在以下位置获得:

  • 0

    codeape发布以来,PDFMiner包已更改 .

    EDIT (again):

    PDFMiner已在版本 20100213 中再次更新

    您可以使用以下内容检查已安装的版本:

    >>> import pdfminer
    >>> pdfminer.__version__
    '20100213'
    

    这是更新版本(包含我更改/添加内容的评论):

    def pdf_to_csv(filename):
        from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
        from pdfminer.converter import LTTextItem, TextConverter
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
    
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item.objs:
                    if isinstance(child, LTTextItem):
                        (_,_,x,y) = child.bbox                   #<-- changed
                        line = lines[int(-y)]
                        line[x] = child.text.encode(self.codec)  #<-- changed
    
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
    
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       #<-- changed
        parser.set_document(doc)     #<-- added
        doc.set_parser(parser)       #<-- added
        doc.initialize('')
    
        interpreter = PDFPageInterpreter(rsrc, device)
    
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
    
        device.close()
        fp.close()
    
        return outfp.getvalue()
    

    Edit (yet again):

    以下是pypi20100619p1 中最新版本的更新 . 简而言之,我将 LTTextItem 替换为 LTChar 并将LAParams实例传递给CsvConverter构造函数 .

    def pdf_to_csv(filename):
        from cStringIO import StringIO  
        from pdfminer.converter import LTChar, TextConverter    #<-- changed
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
    
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item.objs:
                    if isinstance(child, LTChar):               #<-- changed
                        (_,_,x,y) = child.bbox                   
                        line = lines[int(-y)]
                        line[x] = child.text.encode(self.codec)
    
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
    
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       
        parser.set_document(doc)     
        doc.set_parser(parser)       
        doc.initialize('')
    
        interpreter = PDFPageInterpreter(rsrc, device)
    
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            if page is not None:
                interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
    
        device.close()
        fp.close()
    
        return outfp.getvalue()
    

    EDIT (one more time):

    更新版本 20110515 (感谢Oeufcoque Penteano!):

    def pdf_to_csv(filename):
        from cStringIO import StringIO  
        from pdfminer.converter import LTChar, TextConverter
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
    
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item._objs:                #<-- changed
                    if isinstance(child, LTChar):
                        (_,_,x,y) = child.bbox                   
                        line = lines[int(-y)]
                        line[x] = child._text.encode(self.codec) #<-- changed
    
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
    
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       
        parser.set_document(doc)     
        doc.set_parser(parser)       
        doc.initialize('')
    
        interpreter = PDFPageInterpreter(rsrc, device)
    
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            if page is not None:
                interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
    
        device.close()
        fp.close()
    
        return outfp.getvalue()
    
  • 1

    由于这些解决方案都没有支持最新版本的PDFMiner,因此我编写了一个简单的解决方案,它将使用PDFMiner返回pdf文本 . 这适用于那些因 process_pdf 而导致导入错误的人

    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
    from pdfminer.layout import LAParams
    from cStringIO import StringIO
    
    def pdfparser(data):
    
        fp = file(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
    
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
            data =  retstr.getvalue()
    
        print data
    
    if __name__ == '__main__':
        pdfparser(sys.argv[1])
    

    请参阅以下适用于Python 3的代码:

    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
    from pdfminer.layout import LAParams
    import io
    
    def pdfparser(data):
    
        fp = open(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
    
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
            data =  retstr.getvalue()
    
        print(data)
    
    if __name__ == '__main__':
        pdfparser(sys.argv[1])
    
  • 5

    pyPDF工作正常(假设您正在处理格式良好的PDF) . 如果你想要的只是文本(带空格),你可以这样做:

    import pyPdf
    pdf = pyPdf.PdfFileReader(open(filename, "rb"))
    for page in pdf.pages:
        print page.extractText()
    

    您还可以轻松访问元数据,图像数据等 .

    extractText代码注释中的注释:

    按照内容流中提供的顺序查找所有文本绘制命令,然后提取文本 . 这适用于某些PDF文件,但对其他PDF文件效果不佳,具体取决于所使用的生成器 . 这将在未来得到完善 . 不要依赖于此函数的文本顺序,因为如果此功能变得更复杂,它将会改变 .

    这是否是一个问题取决于你正在对文本做什么(例如,如果顺序无关紧要,没关系,或者如果生成器按照它将显示的顺序将文本添加到流中,那很好) . 我在日常使用中有pyPdf提取代码,没有任何问题 .

  • 53

    Pdftotext一个开源程序(Xpdf的一部分),你可以从python调用它(不是你要求但可能有用) . 我没有遇到任何问题 . 我认为谷歌在谷歌桌面使用它 .

  • 8

    您也可以很容易地将pdfminer用作库 . 您可以访问pdf的内容模型,并可以创建自己的文本提取 . 我这样做是为了使用下面的代码将pdf内容转换为分号分隔文本 .

    该函数根据y和x坐标简单地对TextItem内容对象进行排序,并输出具有相同y坐标的项作为一个文本行,将同一行上的对象与';'分开字符 .

    使用这种方法,我能够从pdf中提取文本,其他工具无法提取适合进一步解析的内容 . 我尝试过的其他工具包括pdftotext,ps2ascii和在线工具pdftextonline.com .

    pdfminer是pdf-scraping的宝贵工具 .

    def pdf_to_csv(filename):
        from pdflib.page import TextItem, TextConverter
        from pdflib.pdfparser import PDFDocument, PDFParser
        from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter
    
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
    
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item.objs:
                    if isinstance(child, TextItem):
                        (_,_,x,y) = child.bbox
                        line = lines[int(-y)]
                        line[x] = child.text
    
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
    
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, "ascii")
    
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(doc, fp)
        doc.initialize('')
    
        interpreter = PDFPageInterpreter(rsrc, device)
    
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
    
        device.close()
        fp.close()
    
        return outfp.getvalue()
    

    UPDATE

    上面的代码是针对旧版本的API编写的,请参阅下面的评论 .

  • 40

    slate是一个项目,它使得从库中使用PDFMiner变得非常简单:

    >>> with open('example.pdf') as f:
    ...    doc = slate.PDF(f)
    ...
    >>> doc
    [..., ..., ...]
    >>> doc[1]
    'Text from page 2...'
    
  • 21

    我需要将特定的PDF转换为python模块中的纯文本 . 我使用了PDFMiner 20110515,在阅读了他们的pdf2txt.py工具后,我写了这个简单的片段:

    from cStringIO import StringIO
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    
    def to_txt(pdf_path):
        input_ = file(pdf_path, 'rb')
        output = StringIO()
    
        manager = PDFResourceManager()
        converter = TextConverter(manager, output, laparams=LAParams())
        process_pdf(manager, converter, input_)
    
        return output.getvalue()
    
  • 0

    重新利用pdfminer附带的pdf2txt.py代码;你可以创建一个函数来获取pdf的路径;可选地,outtype(txt | html | xml | tag)和opts类似于命令行pdf2txt {' - o':'/ path / to / outfile.txt'...} . 默认情况下,您可以调用:

    convert_pdf(path)
    

    将创建一个文本文件,将文件系统中的兄弟文件创建为原始pdf .

    def convert_pdf(path, outtype='txt', opts={}):
        import sys
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
        from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, TagExtractor
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfdevice import PDFDevice
        from pdfminer.cmapdb import CMapDB
    
        outfile = path[:-3] + outtype
        outdir = '/'.join(path.split('/')[:-1])
    
        debug = 0
        # input option
        password = ''
        pagenos = set()
        maxpages = 0
        # output option
        codec = 'utf-8'
        pageno = 1
        scale = 1
        showpageno = True
        laparams = LAParams()
        for (k, v) in opts:
            if k == '-d': debug += 1
            elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
            elif k == '-m': maxpages = int(v)
            elif k == '-P': password = v
            elif k == '-o': outfile = v
            elif k == '-n': laparams = None
            elif k == '-A': laparams.all_texts = True
            elif k == '-D': laparams.writing_mode = v
            elif k == '-M': laparams.char_margin = float(v)
            elif k == '-L': laparams.line_margin = float(v)
            elif k == '-W': laparams.word_margin = float(v)
            elif k == '-O': outdir = v
            elif k == '-t': outtype = v
            elif k == '-c': codec = v
            elif k == '-s': scale = float(v)
        #
        CMapDB.debug = debug
        PDFResourceManager.debug = debug
        PDFDocument.debug = debug
        PDFParser.debug = debug
        PDFPageInterpreter.debug = debug
        PDFDevice.debug = debug
        #
        rsrcmgr = PDFResourceManager()
        if not outtype:
            outtype = 'txt'
            if outfile:
                if outfile.endswith('.htm') or outfile.endswith('.html'):
                    outtype = 'html'
                elif outfile.endswith('.xml'):
                    outtype = 'xml'
                elif outfile.endswith('.tag'):
                    outtype = 'tag'
        if outfile:
            outfp = file(outfile, 'w')
        else:
            outfp = sys.stdout
        if outtype == 'txt':
            device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
        elif outtype == 'xml':
            device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, outdir=outdir)
        elif outtype == 'html':
            device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale, laparams=laparams, outdir=outdir)
        elif outtype == 'tag':
            device = TagExtractor(rsrcmgr, outfp, codec=codec)
        else:
            return usage()
    
        fp = file(path, 'rb')
        process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password)
        fp.close()
        device.close()
    
        outfp.close()
        return
    
  • 17

    此外还有PDFTextStream这是一个商业Java库,也可以从Python中使用 .

  • 42

    我使用了带有'-xml'参数的pdftohtml,使用subprocess.Popen()读取结果,它将为pdf中的每个“文本片段”提供x坐标,y坐标,宽度,高度和字体 . 我认为这也是'evince'可能使用的,因为同样的错误信息会喷出来 .

    如果您需要处理柱状数据,那么由于您必须发明适合您的pdf文件的算法,它会稍微复杂一些 . 该问题是制作PDF文件的程序并不一定要以任何逻辑格式布局文本 . 您可以尝试简单的排序算法,它有时会起作用,但是可能会有很少的“落后者”和“流浪者”,文本片段没有按照您认为的顺序放置......因此您必须具有创造性 .

    我花了大约5个小时来找出我正在制作的pdf . 但它现在效果很好 . 祝好运 .

  • 1

    PDFminer在我试过的pdf文件的每一页上都给了我一行[第1页,共7页] .

    到目前为止,我得到的最佳答案是pdftoipe,或者它基于Xpdf的c代码 .

    有关pdftoipe的输出结果,请参阅my question .

  • 127

    今天找到了解决方案 . 对我来说很棒 . 甚至将PDF页面渲染为PNG图像 . http://www.swftools.org/gfx_tutorial.html

相关问题