将HTML中的表格转换为数据框架-Java 学习之路

我正试着从维基百科上刮下 table ，我陷入了僵局 . 我以FIFA 2014世界杯的球队为例 . 在这种情况下，我想从页面“2014 FIFA World Cup squads”的内容表中提取参与国家的列表，并将它们存储为矢量 . 这是我有多远：

library(tidyverse)
library(rvest)
library(XML)
library(RCurl)

(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>% 
  html_node(xpath = '//*[@id="toc"]/ul') %>% 
  htmlTreeParse() %>%
  xmlRoot())

这会吐出一堆我不会在这里复制/粘贴的HTML代码 . 我特意想要提取带有标签 <span class="toctext"> 的所有行，例如"Group A"，"Brazil"，"Cameroon"等，并将它们保存为向量 . 什么功能会使这种情况发生？

1 回答

您可以使用 html_text() 从节点读取文本

url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="toc"]') %>%
    html_text()

这为您提供了单个字符向量 . 然后，您可以拆分 \n 字符，将结果作为向量提供（您可以清除空白）

contents <- strsplit(toc, "\n")[[1]]

contents[contents != ""]

# [1] "Contents"                                   "1 Group A"                                  "1.1 Brazil"                                
# [4] "1.2 Cameroon"                               "1.3 Croatia"                                "1.4 Mexico"                                
# [7] "2 Group B"                                  "2.1 Australia"                              "2.2 Chile"                                 
# [10] "2.3 Netherlands"                            "2.4 Spain"                                  "3 Group C"                                 
# [13] "3.1 Colombia"                               "3.2 Greece"                                 "3.3 Ivory Coast"                           
# [16] "3.4 Japan"                                  "4 Group D"                                  "4.1 Costa Rica"                            
# [19] "4.2 England"                                "4.3 Italy"                                  "4.4 Uruguay"                               
# ---
# etc

通常，要读取html文档中的表，可以使用 html_table() 函数，但在这种情况下，不会读取目录 .

url %>% 
    read_html() %>%
    html_table()

回复于 2024-05-14T16:23:57+08:00

将HTML中的表格转换为数据框架

1 回答

相关问题