我使用rvest从谷歌查询结果中检索标题。我的代码是这样的:
> url = URLencode(paste0("https://www.google.com.au/search?q=","600d"))
> page <- read_html(url)
> page %>%
html_nodes("a") %>%
html_text()
然而,结果不仅包括标题,还包括其他内容,例如:
[24] "Past month"
[25] "Past year"
[26] "Verbatim"
[27] "EOS 600D - Canon"
[28] "Similar"
[29] "Canon 600D | BIG W"
[30] "Cached"
[31] "Similar"
......
[45] ""
[46] ""
我需要的是[27]“EOS 600D - Canon”和[29]“Canon 600D | BIG W”。它们在Google查询中显示如下:
所有其他人对我来说都只是噪音。谁能告诉我如何摆脱那些?
另外,如果我也想要描述部分,我应该做什么?
答案 0 :(得分:2)
要获取标题,请不要使用<a>
(=链接),<h3>
:
page %>%
html_nodes("h3") %>%
html_text()
[1] "EOS 600D - Canon"
[2] "Canon EOS 600D - Wikipedia"
[3] "Canon 600D | BIG W"
[4] "Canon EOS 600D Digital SLR Camera with 18-55mm IS Lens Kit ..."
[5] "Canon Rebel T3i / EOS 600D Review: Digital Photography Review"
[6] "Canon EOS 600D review - CNET"
[7] "canon eos 600d | Cameras | Gumtree Australia Free Local Classifieds"
[8] "Images for 600d"
[9] "Canon 600D - Snapsort"
[10] "Canon EOS 600D - Georges Cameras"