Question

我看过其他帖子显示给extract data from multiple webpages

但问题是，对于我的网站，当我滚动网站查看网页数量以检查数据被分成多少页时，页面会自动刷新下一个数据，从而无法识别网页数量。我不具备html和javascript的良好知识，因此我可以轻松识别调用该方法的属性。所以我已经确定了一种可以获得页数的方法。在浏览器中加载时，网站会显示存在的记录数，访问该数字并将其除以30（每页存在的数据数），例如，如果存在的记录数为90，则 90/30 = 3网页

这里是获取该页面上找到的记录数量的代码

active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))

另一种方法是获取页数的属性，即

url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)

这里有效给我一些页数，"1" " 2" " 3" " 4" 所以我在这里无法确定如何获取活动页面数据并迭代其他数量的网页以获取整个数据。

这是我尝试过的（uuu_df2是我想要抓取数据的多个链接的数据框）

 library(rvest)
 uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
 sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
 Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
 Lacs&BudgetMax=5-Lacs',
                            'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))

    urlList <- llply(uuu_df2[,1], function(url){     

      this_pg <- read_html(url)

      results_count <- this_pg %>% 
        xml_find_first(".//span[@id='resultCount']") %>% 
        xml_text() %>%
        as.integer()

      if(!is.na(results_count) & (results_count > 0)){

        cards <- this_pg %>% 
          xml_find_all('//div[@class="SRCard"]')

        df <- ldply(cards, .fun=function(x){
          y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
                          excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
                          locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
                          society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
          return(y)
        })

      } else {
        df <- NULL
      }

      return(df)   
    }, .progress = 'text')
    names(urlList) <- uuu_df2[,1]

    a=bind_rows(urlList)

但是这段代码只是从活动页面提供数据，而不是遍历给定链接的其他页面。

P.S：如果链接没有任何记录，则代码会跳过该链接移动到列表中的其他链接。

有关应对代码进行哪些更改的任何建议都会有所帮助。提前谢谢。

从网站中的多个网页中提取数据，该网站在r中自动重新加载

0 个答案: