首页 文章

Web使用R将多个页面串联

提问于
浏览
1

如何废弃70页的html数据?我正在看这个question但我被困在一般方法部分的功能 .

#attempt

library(purrr)

url_base <-"https://secure.capitalbikeshare.com/profile/trips/QNURCMF2Q6"

map_df(1:70, function(i) {

cat(".")

pg <- read_html(sprintf(url_base, i))   

data.frame( startd=html_text(html_nodes(pg, ".ed-table__col_trip-start-date")), 
endd=html_text(html_nodes(pg,".ed-table__col_trip-end-date")),
duration=html_text(html_nodes(pg, ".ed-table__col_trip-duration"))
)
}) -> table



#attempt 2 (with just one data column)

url_base <-"https://secure.capitalbikeshare.com/profile/trips/QNURCMF2Q6"


map_df(1:70, function(i) {

page %>% html_nodes(".ed-table__item_odd") %>% html_text()

}) -> table

1 回答

  • 0

    @ jso1226,我不确定你所引用的答案中发生了什么,所以我提供了一个与你想要做的非常相似的例子 .

    其中:转到网页收集信息,添加数据框,然后转到下一页 .

    我使用这个代码创建来跟踪我在这里发布到stackoverflow的答案:

    login<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f"
    
    library(rvest)
    pgsession<-html_session(login)
    pgform<-html_form(pgsession)[[2]]
    filled_form<-set_values(pgform, email="*****", password="*****")
    submit_form(pgsession, filled_form)
    
    #pre allocate the final results dataframe.
    results<-data.frame()  
    
    for (i in 1:5)
    {
      url<-"http://stackoverflow.com/users/**********?tab=answers&sort=activity&page="
      url<-paste0(url, i)
      page<-jump_to(pgsession, url)
    
      #collect question votes and question title
      summary<-html_nodes(page, "div .answer-summary")
      question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow = TRUE)
    
      #find date answered, hyperlink and whether it was accepted
      dateans<-html_node(summary, "span") %>% html_attr("title")
      hyperlink<-html_node(summary, "div a") %>% html_attr("href")
      accepted<-html_node(summary, "div") %>% html_attr("class")
    
      #create temp results then bind to final results 
      rtemp<-cbind(question, dateans, accepted, hyperlink)
      results<-rbind(results, rtemp)
    }
    
    #Dataframe Clean-up
    names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
    results$Votes<-as.integer(as.character(results$Votes))
    results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)
    

    在这种情况下,循环仅限于5页,这需要更改以适合您的应用程序 . 我用******替换了用户特定的值,希望这将为您提供一些指导 .

相关问题