Question

我正在使用rvest进行网络抓取项目。

html_text(html_nodes(url, CSS))

从url中提取匹配CSS的数据。我的问题是，我正在抓取的网站为每个列出的产品使用唯一的CSS ID（例如ListItem_001_Price）。因此1 CSS正好定义了1个商品的价格，因此自动网页编写功能不起作用

我可以创建一个矢量

V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")

手动显示所有产品的CSS ID。有可能一次性将它的各个元素传递给html_nodes()函数，因此将结果数据作为单个向量/数据帧收集回来吗？

如何让它发挥作用？

Answer 1

您可以尝试在此使用lapply：

V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
results <- lapply(V, function(x) html_text(html_nodes(url, x)))

我在此假设您对html_text的嵌套调用通常会返回V中每个项目对应于匹配节点的文本的字符向量。这将为您提供一个矢量列表，然后您可以访问它们。

Answer 2

html_nodes()需要最初的＆＃34;。＆＃34;通过css-class查找您的标签。您可以手动创建

V <- c(".ListItem_001_Price", ".ListItem_002_Price", ".ListItem_003_Price")

就像你说的那样，但我建议你使用正则表达式来匹配像'ListItem_([0-9]{3})_Price'这样的类，这样你就可以避免手工劳动了。确保您对标记的实际字符串进行正则表达式，而不是在html节点对象上。（见下文）

在R中，apply（），lapplay（），sapplay（）等工作就像一个短循环。在其中，您可以将函数应用于包含众多值的每个数据类型成员，如列表，数据框，矩阵或向量。

在你的情况下，它是一个向量，一种开始理解它是如何工作的方法就是这样想：

sapply(vector, function(x) THING-TO-DO-WITH-ITEM-IN-VECTOR)

在您的情况下，您希望与vector 中的项目相关的事情是获取与向量中的项目对应的html_text。请参阅以下代码以获取示例：

library(rvest)
# An example piece of html
example_markup <- "<ul>
<li class=\"ListItem_041_Price\">Brush</li>
<li class=\"ListItem_031_Price\">Phone</li>
<li class=\"ListItem_002_Price\">Paper clip</li>
<li class=\"ListItem_012_Price\">Bucket</li>
</ul>"
html <- read_html(example_markup)


# Avoid manual creation of css with regex
regex <- 'ListItem_([0-9]{3})_Price'
# Note that ([0-9]{3}) will match three consecutive numeric characters
price_classes <- regmatches(example_markup, gregexpr(regex, example_markup))[[1]]
# Paste leading "." so that html_nodes() can find the class:
price_classes <- paste(".", price_classes, sep="")

# A singel entry is found like so:
html %>% html_nodes(".ListItem_031_Price") %>% html_text()

# Use sapply to get a named character vector of your products
# Note how ".ListItem_031_Price" from the line above is replaced by x
# which will be each item of price_classes in turn.
products <- sapply(price_classes, function(x) html %>% html_nodes(x) %>% html_text())

产品中的结果是一个命名的字符向量。使用unname(products)删除名称。

如何将向量元素作为单独的参数传递给R中的函数

2 个答案: