使用XML R包刮取带有图像的html表-Java 学习之路

我想使用R的XML包来抓取html表，与此线程中讨论的方式类似：

Scraping html tables into R data frames using the XML package

与我想要提取的数据的主要区别在于，我还想要与html表中的图像相关的文本 . 例如，http://www.theplantlist.org/tpl/record/kew-422570处的表包含"Confidence"的列，其中的图像显示为一到三颗星 . 如果我使用：

readHTMLTable（“http://www.theplantlist.org/tpl/record/kew-422570”）

那么“置信度”的输出列除了 Headers 之外是空白的 . 有没有办法在此列中获取某种形式的数据，例如链接到相应图像的HTML代码？

任何关于如何去做的建议将不胜感激！

3 回答

我能够使用SelectorGadeget找到图像名称的Xpath查询

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

回复于 2024-05-10T04:09:01+08:00

这是一个具有更简单的CSS选择器的 rvest 解决方案：

library(rvest)

pg <- html("http://www.theplantlist.org/tpl/record/kew-422570")
pg %>% html_nodes("td > img") %>% html_attr("src")

## [1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
## [6] "/img/H.png" "/img/H.png"

回复于 2024-05-10T04:09:01+08:00

您还可以使用 elFun 参数在XML book中的5.2.2.1节之后提取该属性（我必须添加...以避免未使用的参数错误）

getCL <- function(node, ...){
if(xmlName(node) == "td" && !is.null(node[["img"]]))
    xmlGetAttr(node[["img"]], "alt")
  else
    xmlValue(node)
}

url <- "http://www.theplantlist.org/tpl/record/kew-422570"
readHTMLTable(url, which=1, elFun = getCL)

                                                Name  Status Confi-dence level Source
1                                Elymus arenarius L. Synonym                 H   WCSP
2 Elymus arenarius subsp. geniculatus (Curtis) Husn. Synonym                 L    TRO
3                Elymus geniculatus Curtis [Invalid] Synonym                 H   WCSP
4              Frumentum arenarium (L.) E.H.L.Krause Synonym                 H   WCSP
5                       Hordeum arenarium (L.) Asch. Synonym                 H   WCSP
6                            Hordeum villosum Moench Synonym                 H   WCSP
7                    Triticum arenarium (L.) F.Herm. Synonym                 H   WCSP

回复于 2024-05-10T04:09:01+08:00

使用XML R包刮取带有图像的html表

3 回答

相关问题