使用Python从Latex导出的PDF的部分中提取文本-Java 学习之路

我有用Latex编写的科研论文PDF . 每篇研究论文都有“介绍”，“相关工作”等部分，我想分别提取每个部分下的文字 .

Sample Image of a PDF with sections such as "Abstract" and "Introduction"

此PDF在第1页上有“摘要”和“简介”部分 . 对于“摘要”部分，我想以斜体检索文本 . 对于“简介”，我希望其章节中的所有段落 .

如果我有Latex源文件，我可以做一些数据挖掘并根据\ section {}关键字提取文本

因此，我在Python3中尝试了几种方法，例如将pdf转换为latex [link]，但建议的软件要么与我的系统不兼容（Ubuntu 16.04），要么是付费软件 . 我尝试使用textract，但它没有从PDF中提取部分的选项 .

有谁知道如何从使用Latex制作的PDF中提取部分？

1 回答

我不知道如何使用R执行此操作，但如果您将所有PDF文件放在一个文件夹中，循环遍历它们并将每个文件转换为一堆Word文档，您就可以轻松使用VBA来完成任务 .

Sub SelectBetweenHeadings()
    With Selection
        .GoTo What:=wdGoToHeading, Which:=wdGoToPrevious
        .Collapse
        Dim curRange As Range
        Set curRange = .Range
        .Extend
        .GoTo What:=wdGoToHeading, Which:=wdGoToNext
        If .Range = curRange Then
            .EndKey Unit:=wdStory
        End If
        .ExtendMode = False
    End With
End Sub

或者，为了更精确，试试这个 .

Sub Macro1()
    Selection.WholeStory
    Selection.Collapse wdCollapseStart

    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Caption 1")
    With Selection.Find
        .Text = "2.3.1"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = True
    End With
    Selection.Find.Execute
    Selection.Collapse wdCollapseStart

    Dim r1 As Range
    Set r1 = Selection.Range

    ' keep format settings, only change text
    Selection.Find.Text = "2.3.2"
    If Selection.Find.Execute Then
        Selection.Collapse wdCollapseStart
    Else
        Selection.WholeStory
        Selection.Collapse wdCollapseEnd
    End If
    Dim r2 As Range
    Set r2 = ActiveDocument.Range(r1.Start, Selection.Start)
    r2.Select

End Sub

回复于 2024-05-03T22:02:52+08:00

使用Python从Latex导出的PDF的部分中提取文本

1 回答

相关问题