首页 文章

如何使用R将表从PDF提取到可用的tibble中

提问于
浏览
0

我正在尝试使用R从 .pdf 文件中提取表 . 我尝试使用 tabulizer 包将表提取到一个大的列表中 . 我想通过清理表(它们都是不同的)并将它们放入 tibble (或 data.frame )中,进一步采取两个步骤 .

#incase you don't have the tabulizer package, the below is needed
install.packages("rJava")
library(rJava) # load and attach 'rJava' now
install.packages("devtools")
devtools::install_github("ropensci/tabulizer", args="--no-multiarch")
library(tabulizer)

#set path to file
file <- "https://www.sdccu.com/CURates/HomeLoanRates.pdf"

#extract tables
mortgagerates <- extract_tables(file, encoding = "UTF-8")

#first table from the third page
mortgagerates[[7]]

这是最后一行代码的输出:

> mortgagerates[[7]]
  [,1]                                                                                                                  
 [1,] "ADJUSTABLE RATE MORTGAGES: JUMBO LOANS $453,101 TO $1,500,000 
(Purchase or Refinance)"                               
 [2,] "Available for all counties:"                                                                                         
 [3,] " Purchases or refinances up to 95% LTV with a maximum loan amount of 
$679,650.  Cash-out refinances up to 70% LTV."
 [4,] ""                                                                                                                    
 [5,] " Purchases or refinances up to 80% LTV with a maximum loan amount of 
$1,500,000."                                   
 [6,] "Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR) 
Points Per $1,000 Borrowed Estimated Payment"       
 [7,] "5/1 CMT 3.500% 4.394% 0.000 $4.49"                                                                                   
 [8,] "7/1 CMT 3.750% 4.358% 0.000 $4.63"                                                                                   
 [9,] "3.500% 4.322% 1.000 $4.49"

什么是最好的方式将其与实际的pdf文档中的内容类似?我想从下表中得到的图像:

enter image description here

以下是 dput(mortgagerates[7]) 的更新

> file
  [,1]                                                                                                                  
 [1,] "ADJUSTABLE RATE MORTGAGES: JUMBO LOANS $453,101 TO $1,500,000 
(Purchase or Refinance)"                               
 [2,] "Available for all counties:"                                                                                         
 [3,] " Purchases or refinances up to 95% LTV with a maximum loan amount of 
 $679,650.  Cash-out refinances up to 70% LTV."
 [4,] ""                                                                                                                    
 [5,] " Purchases or refinances up to 80% LTV with a maximum loan amount of 
 $1,500,000."                                   
 [6,] "Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR) 
Points 
Per $1,000 Borrowed Estimated Payment"       
 [7,] "5/1 CMT 3.500% 4.394% 0.000 $4.49"                                                                                   
 [8,] "7/1 CMT 3.750% 4.358% 0.000 $4.63"                                                                                   
 [9,] "3.500% 4.322% 1.000 $4.49"

1 回答

  • 0

    此文件中的表格布局过于复杂,无法在没有更多输入的情况下自动提取 . 使用 tabulizer 解决它的方法是提供包含表格的区域 . 对于这个特定的表,您可以执行以下操作:

    file <- "https://www.sdccu.com/CURates/HomeLoanRates.pdf"
    area <- locate_areas(file, pages = 3)
    area
    [[1]]
          top      left    bottom     right 
    442.20975  30.50972 549.83752 592.01857
    mortgagerates <- extract_tables(file, pages = 3, area = area, guess = FALSE)
    

    这给出了:

    > as.data.frame(mortgagerates[[1]])
                                                         V1         V2 V3     V4                                    V5
    1 Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR)    Points Estimated Payment Per $1,000 Borrowed
    2                                        5/1 CMT 3.625%     4.439%     0.000                                 $4.56
    3                                        7/1 CMT 3.875%     4.417%     0.000                                 $4.70
    4                                                3.625%     4.381%     1.000                                 $4.56
    

相关问题