从链接提取文本时出现问题

时间:2019-04-16 03:20:54

标签: r web-scraping rvest

我正在使用rvest软件包(R)从网站中提取文本,但是在获取所需的确切文本时遇到问题。

通过此链接 https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html

我需要从hr.wp-block-separator中提取所有文本,包括制动器

这是我的代码:

con <- url("https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html", "rb") ##binary connection to URL

content <- read_html(con)

p <- html_nodes(content, "p") %>% html_text()

在第25行和第60行,我得到以下结果:

[25] "Threat name: __Denuncia_Activa_CL.PDF.batMD5: 1e541b14b531bcac70e77a012b0f0f7fSHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81First submission: 2019-03-22 00:39:43"

[60] "Threat name: Integrity.exeMD5: 98172becba685afdd109ac909e3a1085SHA1: cbb0377ec81d8b120382950953d9069424fb100eFirst submission: 2019-03-18 15:10:08" 

MD5哈希使用单词“ SHA1”折叠,而SHA1哈希使用单词“ First”折叠

理想情况下,我希望每行获得4行文本1属性(威胁名称,MD5,SHA1,首次提交)

这是我的代码:

con <- url("https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html", "rb") ##binary connection to URL

content <- read_html(con)

p <- html_nodes(content, "p") %>% html_text()

实际输出

"Threat name: __Denuncia_Activa_CL.PDF.batMD5: 1e541b14b531bcac70e77a012b0f0f7fSHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81First submission: 2019-03-22 00:39:43"

预期产量

Threat name: __Denuncia_Activa_CL.PDF.bat

MD5: 1e541b14b531bcac70e77a012b0f0f7f

SHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81

First submission: 2019-03-22 00:39:43

0 个答案:

没有答案