Question

我正试图从这个link中提取摘要。但是，我无法仅提取摘要的内容。这是我到目前为止所取得的成就：

url <- "http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1981-38212013000100001&lng=en&nrm=iso&tlng=en"
textList <- readLines(url)
text <- textList[grep("Abstract[^\\:]", textList)] # get the correct element
text1 <- gsub("\\b(.*?)\\bISSN", "" , text)

到目前为止，我几乎得到了我想要的东西，但后来我无法摆脱我不感兴趣的其余字符串。

我甚至尝试了另一种方法，使用xpath，但没有成功。我试过类似下面的代码，但没有任何效果。

library(XML)
arg.xpath <-"//p/@xmlns"
doc <- htmlParse( url)   # parseia url
linksAux <- xpathSApply(doc, arg.xpath)

如何使用正则表达式或xpath，或者两者兼而有之，我怎样才能满足我的需求？

ps：我的总体目标是对我提供的几个类似页面进行webscraping。我alredy可以提取链接。我现在只需要摘要。自由（DOC）

Answer 1

有人可以给你一个更好的答案，但这种方法有效

reg=regexpr("<p xmlns=\"\">(.*?)</p>",text1)  
begin=reg[[1]]+12
end=attr(reg,which = "match.length")+begin-17
substr(text1,begin,end)

Answer 2

我强烈推荐使用XML方法，因为使用HTML的正则表达式可能非常令人头痛。我认为你的xpath表达只是有点偏。尝试

doc <- htmlParse(url)
xpathSApply(doc, "//p[@xmlns]", xmlValue)

返回（剪裁长度）

[1] "HOLLANDA,  Cristina Buarque de. Human rights ..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[2] "This article is dedicated to recounting the main ..."
[3] "Keywords\n\t\t:\n\t\tHuman rights; transitional ..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
[4] ""

Answer 3

这是另一种方法，它是书面形式的笨拙，但提供了在标记标记分割后保留正确部分的技术：

text2 <- sapply(strsplit(x = text1, ">"), "[", 3)
text2
[1] "This article is dedicated to recounting the main initiative of Nelson Mandela's government to manage the social resentment inherited from the segregationist regime. I conducted interviews with South African intellectuals committed to the theme of transitional justice and with key personalities who played a critical role in this process. The Truth and Reconciliation Commission is presented as the primary institutional mechanism envisioned for the delicate exercise of redefining social relations inherited from the apartheid regime in South Africa. Its founders declared grandiose political intentions to the detriment of localized more palpable objectives. Thus, there was a marked disparity between the ambitious mandate and the political discourse about the commission, and its actual achievements.</p"
text3 <- sapply(strsplit(text2, "<"), "[", 1)

使用xpath或regex在R中进行Web Scraping（可能）格式错误的HTML

3 个答案: