我正在使用rvest软件包(R)从网站中提取文本,但是在获取所需的确切文本时遇到问题。
通过此链接 https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html
我需要从hr.wp-block-separator中提取所有文本,包括制动器
。
这是我的代码:
con <- url("https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html", "rb") ##binary connection to URL
content <- read_html(con)
p <- html_nodes(content, "p") %>% html_text()
在第25行和第60行,我得到以下结果:
[25] "Threat name: __Denuncia_Activa_CL.PDF.batMD5: 1e541b14b531bcac70e77a012b0f0f7fSHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81First submission: 2019-03-22 00:39:43"
[60] "Threat name: Integrity.exeMD5: 98172becba685afdd109ac909e3a1085SHA1: cbb0377ec81d8b120382950953d9069424fb100eFirst submission: 2019-03-18 15:10:08"
MD5哈希使用单词“ SHA1”折叠,而SHA1哈希使用单词“ First”折叠
理想情况下,我希望每行获得4行文本1属性(威胁名称,MD5,SHA1,首次提交)
这是我的代码:
con <- url("https://securityaffairs.co/wordpress/83620/breaking-news/emotet-targets-chile.html", "rb") ##binary connection to URL
content <- read_html(con)
p <- html_nodes(content, "p") %>% html_text()
实际输出
"Threat name: __Denuncia_Activa_CL.PDF.batMD5: 1e541b14b531bcac70e77a012b0f0f7fSHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81First submission: 2019-03-22 00:39:43"
预期产量
Threat name: __Denuncia_Activa_CL.PDF.bat
MD5: 1e541b14b531bcac70e77a012b0f0f7f
SHA1: 0ca0cd36fb4c9dfeb3e325a01cfb7b75413d1f81
First submission: 2019-03-22 00:39:43