Question

我是R的新手并且一直在做一些网络抓取。我编写了以下代码，将https://uk.burberry.com/中特定项目的ID，名称，颜色和价格放入数据框中。

while read line; do
  for token in $line; do
    case $token in
    hello) echo hello,1;;
    world) echo world,1;;
    esac
  done
done

有没有办法创建循环，所以我可以将此代码用于网站上的每个项目并将结果放在数据框中？感谢

Answer 1

您可以创建一个接受输入网址的函数，并返回一个数据框，其中包含从网页收集的信息：

get_page_data <- function(url) {
    # Read HTML code from the website
    webpage <- read_html(url)

    # using css selectors to scrape the ID section
    id_data_html <- html_nodes(webpage, '.section') 
    #converting the ID to text
    id_data <- html_text(id_data_html)
    # Remove irrelevant text
    id_data <- gsub("Item", "", id_data)

    # using css selectors to scrape the names section
    names_data_html <- html_nodes(webpage, '.type-h6') 
    #converting the names to text
    names_data <- html_text(names_data_html)
    # Stripping irrelevant text
    names_data <- gsub("\n\t\t\t\t\t\t\t", "", names_data)

    # using css selectors to scrape the price section
    price_data_html <- html_nodes(webpage, '.l2') 
    #converting the price to text
    price_data <- html_text(price_data_html)
    # Remove irrelevant text
    price_data <- gsub("\t", "", price_data)
    price_data <- gsub("\n", "", price_data)

    # using css selectors to scrape the colour section
    colour_data_html <- html_nodes(webpage, '#colour-picker-value') 
    #converting the colour to text
    colour_data <- html_text(colour_data_html)

    # creating the dataframe
    burberry_df <- data.frame(ID = id_data, Name = names_data, Price = price_data,
                              Colour = colour_data)

    return(burberry_df)
}

然后使用该函数只需在传递感兴趣的URL时调用它：

url <- 'https://uk.burberry.com/fringed-wool-cashmere-patchwork-cardigan-coat-p40612561'
result <- get_page_data(url)

使用rvest来搜索r中的多个网页

1 个答案: