使用正则表达式从html文件中提取URL

时间:2018-03-27 05:54:15

标签: r regex

我正在尝试将唯一的网址/〜/写入.ashx。 从我已经刮掉的完整的html源文件,我尝试了以下函数来获得href匹配列表。

processHTML <- function(html) {
  doc <- htmlTreeParse(html, useInternalNodes=TRUE)
  text <- xpathSApply(doc, "//a/@href")
}

从下面的代码片段我需要pic只排除href和qoutations,/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx

href   "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"

请帮我解决上述问题的正则表达式

1 个答案:

答案 0 :(得分:1)

如果我理解了这个问题,那么这可能会有所帮助

txt[grepl('.ashx', txt)][['href']]

输出为:

[1] "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"

示例数据:

txt <- structure(c("mailto:?subject=From%20mckinsey.com%3a%20Discussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f&body=I%20recommend%20you%20visit%20mckinsey.com%20to%20read%3a%0d%0a%0d%0aDiscussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f%0d%0ahttp%3a%2f%2fwww.mckinsey.com%2fbusiness-functions%2fmarketing-and-sales%2four-insights%2fdiscussions-in-digital-whats-a-marketing-ecosystem%3fcid%3deml-web", 
"/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"
), .Names = c("href", "href"))