首页 文章

从PDF中提取文本

提问于
浏览
7

我有一堆PDF文件,我需要转换为TXT . 不幸的是,当我使用众多可用实用程序中的一个来执行此操作时,它会丢失所有格式,并且PDF中的所有列表数据都会混乱 . 是否可以使用Python通过指定位置等从PDF中提取文本?

谢谢 .

4 回答

  • 3

    PDF不包含表格数据,除非它包含结构化内容 . 一些工具包括试探法来尝试猜测数据结构并将其反馈 . 我在http://www.jpedal.org/PDFblog/2009/04/pdf-text/撰写了一篇博客文章,解释了PDF文本提取的问题 .

  • 1

    我遇到了类似的问题,并最终使用了_1408719中的XPDF . 其中一个工具是PDFtoText,但我想这一切都可以实现,PDF是如何产生的 .

  • 0
    $ pdftotext -layout thingwithtablesinit.pdf
    

    将生成一个文本文件thingwithtablesinit.txt与右表 .

  • 1

    如其他答案中所述,从PDF中提取文本不是一项简单的任务 . 但是,有一些Python库,如 pdfminer (Python 3的 pdfminer3k ),效率相当高 .

    下面的代码片段显示了一个Python类,可以将其实例化以从PDF中提取文本 . 这在大多数情况下都有效 .

    (来源 - https://gist.github.com/vinovator/a46341c77273760aa2bb

    # Python 2.7.6
    # PdfAdapter.py
    
    """ Reusable library to extract text from pdf file
    Uses pdfminer library; For Python 3.x use pdfminer3k module
    Below links have useful information on components of the program
    https://euske.github.io/pdfminer/programming.html
    http://denis.papathanasiou.org/posts/2010.08.04.post.html
    """
    
    
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    # From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    # from pdfminer.pdfdevice import PDFDevice
    # To raise exception whenever text extraction from PDF is not allowed
    from pdfminer.pdfpage import PDFTextExtractionNotAllowed
    from pdfminer.layout import LAParams, LTTextBox, LTTextLine
    from pdfminer.converter import PDFPageAggregator
    import logging
    
    __doc__ = "eusable library to extract text from pdf file"
    __name__ = "pdfAdapter"
    
    """ Basic logging config
    """
    log = logging.getLogger(__name__)
    log.addHandler(logging.NullHandler())
    
    
    class pdf_text_extractor:
        """ Modules overview:
         - PDFParser: fetches data from pdf file
         - PDFDocument: stores data parsed by PDFParser
         - PDFPageInterpreter: processes page contents from PDFDocument
         - PDFDevice: translates processed information from PDFPageInterpreter
            to whatever you need
         - PDFResourceManager: Stores shared resources such as fonts or images
            used by both PDFPageInterpreter and PDFDevice
         - LAParams: A layout analyzer returns a LTPage object for each page in
             the PDF document
         - PDFPageAggregator: Extract the decive to page aggregator to get LT
             object elements
        """
    
    def __init__(self, pdf_file_path, password=""):
        """ Class initialization block.
        Pdf_file_path - Full path of pdf including name
        password = If not passed, assumed as none
        """
        self.pdf_file_path = pdf_file_path
        self.password = password
    
    def getText(self):
        """ Algorithm:
        1) Txr information from PDF file to PDF document object using parser
        2) Open the PDF file
        3) Parse the file using PDFParser object
        4) Assign the parsed content to PDFDocument object
        5) Now the information in this PDFDocumet object has to be processed.
        For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
        6) Finally process the file page by page
        """
    
        # Open and read the pdf file in binary mode
        with open(self.pdf_file_path, "rb") as fp:
    
            # Create parser object to parse the pdf content
            parser = PDFParser(fp)
    
            # Store the parsed content in PDFDocument object
            document = PDFDocument(parser, self.password)
    
            # Check if document is extractable, if not abort
            if not document.is_extractable:
                raise PDFTextExtractionNotAllowed
    
            # Create PDFResourceManager object that stores shared resources
            # such as fonts or images
            rsrcmgr = PDFResourceManager()
    
            # set parameters for analysis
            laparams = LAParams()
    
            # Create a PDFDevice object which translates interpreted
            # information into desired format
            # Device to connect to resource manager to store shared resources
            # device = PDFDevice(rsrcmgr)
            # Extract the decive to page aggregator to get LT object elements
            device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    
            # Create interpreter object to process content from PDFDocument
            # Interpreter needs to be connected to resource manager for shared
            # resources and device
            interpreter = PDFPageInterpreter(rsrcmgr, device)
    
            # Initialize the text
            extracted_text = ""
    
            # Ok now that we have everything to process a pdf document,
            # lets process it page by page
            for page in PDFPage.create_pages(document):
                # As the interpreter processes the page stored in PDFDocument
                # object
                interpreter.process_page(page)
                # The device renders the layout from interpreter
                layout = device.get_result()
                # Out of the many LT objects within layout, we are interested
                # in LTTextBox and LTTextLine
                for lt_obj in layout:
                    if (isinstance(lt_obj, LTTextBox) or
                            isinstance(lt_obj, LTTextLine)):
                        extracted_text += lt_obj.get_text()
    
        return extracted_text.encode("utf-8")
    

    注 - 还有其他库,例如 PyPDF2 ,它们擅长转换PDF,例如合并PDF页面,拆分或裁剪特定页面等 .

相关问题