我想用R和rvest刮this web page。我想以这种格式提取50个单词:
到目前为止,我只能做到这一点:
curl_setopt($curl, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEJAR, 'cookies.txt');
我只能用所有50个单词达到此输出:
任何人都可以帮我解决这个问题吗?
致以最诚挚的问候,
苏曼
答案 0 :(得分:1)
我将just_words列表转换为数据框,然后在separate
包中使用tidyr
来拆分列。
library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
x <- as.data.frame(strsplit(just_words,"\r\n\t"), col.names = "V1")
head(x)
t <- x %>% separate(V1, c("Word", "Meaning"), extra = "merge", fill = "left")
head(t)
输出:
> head(t)
Word Meaning
1 abstract not concrete
2 aesthetic having to do with the appreciation of beauty
3 alleviate to ease a pain or a burden
4 ambivalent simultaneously feeling opposing feelings; uncertain
5 apathetic feeling or showing little emotion
6 auspicious favorable; promising
如果您正在寻找更加格式化的输出,请使用pander包。输出显示如下:
> library(pander)
> pander(head(t))
---------------------------------------
Word Meaning
---------- ----------------------------
abstract not concrete
aesthetic having to do with the
appreciation of beauty
alleviate to ease a pain or a burden
ambivalent simultaneously feeling
opposing feelings; uncertain
apathetic feeling or showing little
emotion
auspicious favorable; promising
---------------------------------------
删除换行符和空格。
t <- t %>% mutate(Meaning=gsub("[\r\n]", "", Meaning)) %>% tail()
答案 1 :(得分:1)
如果您查看网页的详细信息(右键单击并在Chrome中检查),您会看到该字以粗体显示(包含在同一strong
下的li
子标记中节点)。因此,应该可以单独获取这两个项目。
library(rvest)
words <-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/") %>%
html_nodes("#article-detail3 li")
data.frame( words = words %>% html_node("strong") %>% html_text(),
meaning = words %>% html_node(xpath="./text()") %>% html_text())
xpath=./text()
指定您只想要父节点的文本,而不是子节点。