Question

我正在尝试从亚马逊网站上搜寻产品评论者的位置。例如，此网页

[https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8][1]

我需要得到HAINESVILLE, ILLINOIS, United States

我使用rvest软件包进行网络抓取。

这是我所做的：

library(rvest)       
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
page = read_html(url)

我收到如下错误：

Error in open.connection(x, "rb") : HTTP error 403.

但是，以下工作原理：

con <- url(url, "rb")
page = read_html(con)

但是，在阅读的页面中，我无法提取任何文本。例如，我要提取审阅者的位置。

page %>%
    html_nodes("#customer-profile-name-header .a-size-base a-color-base")%>%
    html_text()

我一无所有

character(0)

谁能帮我弄清楚我做错了什么？提前谢谢。

Answer 1

这应该有效：

library(dplyr)
library(rvest)
library(stringr)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'

# open page
con <- url(url, "rb")
page = read_html(con)

# get the desired information, using View Page Source
page %>%
  html_nodes(xpath=".//script[contains(., 'occupation')]")%>%
  html_text() %>% as.character() %>% str_match(.,"location\":\"(.*?)\",\"personalDescription") -> res

res[,2]

无法使用r的rvest包读取带有read_html的网页

1 个答案: