我正在尝试将唯一的网址/〜/写入.ashx。 从我已经刮掉的完整的html源文件,我尝试了以下函数来获得href匹配列表。
processHTML <- function(html) {
doc <- htmlTreeParse(html, useInternalNodes=TRUE)
text <- xpathSApply(doc, "//a/@href")
}
从下面的代码片段我需要pic只排除href和qoutations,/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx
:
href "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"
请帮我解决上述问题的正则表达式
答案 0 :(得分:1)
如果我理解了这个问题,那么这可能会有所帮助
txt[grepl('.ashx', txt)][['href']]
输出为:
[1] "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"
示例数据:
txt <- structure(c("mailto:?subject=From%20mckinsey.com%3a%20Discussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f&body=I%20recommend%20you%20visit%20mckinsey.com%20to%20read%3a%0d%0a%0d%0aDiscussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f%0d%0ahttp%3a%2f%2fwww.mckinsey.com%2fbusiness-functions%2fmarketing-and-sales%2four-insights%2fdiscussions-in-digital-whats-a-marketing-ecosystem%3fcid%3deml-web",
"/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"
), .Names = c("href", "href"))