Question

我想用R和rvest刮this web page。我想以这种格式提取50个单词：

到目前为止，我只能做到这一点：

curl_setopt($curl, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEJAR, 'cookies.txt');

我只能用所有50个单词达到此输出：

任何人都可以帮我解决这个问题吗？

致以最诚挚的问候，

苏曼

Answer 1

我将just_words列表转换为数据框，然后在separate包中使用tidyr来拆分列。

library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
x <- as.data.frame(strsplit(just_words,"\r\n\t"), col.names = "V1")
head(x)
t <- x %>% separate(V1, c("Word", "Meaning"), extra = "merge", fill = "left")
head(t)

输出：

> head(t)
        Word                                             Meaning
1   abstract                                        not concrete
2  aesthetic        having to do with the appreciation of beauty
3  alleviate                          to ease a pain or a burden
4 ambivalent simultaneously feeling opposing feelings; uncertain
5  apathetic                   feeling or showing little emotion
6 auspicious                                favorable; promising

如果您正在寻找更加格式化的输出，请使用pander包。输出显示如下：

> library(pander)
> pander(head(t))

---------------------------------------
   Word              Meaning           
---------- ----------------------------
 abstract          not concrete        

aesthetic     having to do with the    
              appreciation of beauty   

alleviate   to ease a pain or a burden 

ambivalent    simultaneously feeling   
           opposing feelings; uncertain

apathetic   feeling or showing little  
                     emotion           

auspicious     favorable; promising    
---------------------------------------

删除换行符和空格。

t <- t %>% mutate(Meaning=gsub("[\r\n]", "", Meaning)) %>% tail()

Answer 2

如果您查看网页的详细信息（右键单击并在Chrome中检查），您会看到该字以粗体显示（包含在同一strong下的li子标记中节点）。因此，应该可以单独获取这两个项目。

library(rvest)
words <-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/") %>% 
  html_nodes("#article-detail3 li")

data.frame( words = words %>% html_node("strong") %>% html_text(),
            meaning = words %>% html_node(xpath="./text()") %>% html_text())

xpath=./text()指定您只想要父节点的文本，而不是子节点。

网络搜索与R

2 个答案: