很抱歉,如果这是重复的......我不清楚SO上已有的如何执行此特定任务。
我的目标是在某些HTML代码中找到压缩文件的文件名。文件名在<a href=...>
html块内,因此人们很容易找到。
这里有一些代码可以重现我正在看的内容:
# character vector with two strings from my html file
string.examples <-
c("ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a> | <a href=\"../cdf/cdf_errata.htm\">Errata</a> | <a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | August 25, 2011 version </td></tr>",
"ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a> | <a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a> | <a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | July 1, 2013 version<br />"
)
埋在第一行的深处,有文字<a href=\"../data/cdf/anes_cdfdta.zip\"
,在第二行,有文字<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\"
从这两行开始,我想提取../data/cdf/anes_cdfdta.zip
和../data/anes_timeseries_2012/anes2012TS_dta.zip
,因为它们包含文字dta.zip
,因为它们以{{1}开始然后以<a href=\"
我想要一些地方:
\"
使用..
生成长度为2的字符向量x <- some.regex.function( string.examples )
答案 0 :(得分:3)
在这里,我假设您正在寻找的模式在a href=\"
之后开始,以dta.zip
结束。因此,我们的想法是使用贪婪搜索来完成所有a href
直到dta.zip
。此外,我们捕获每个部分并用所需的捕获替换搜索到的字符串。
gsub("(.*a href=\\\")(.*dta\\.zip)(.*)$", "\\2", string.examples)
前面提到的.*a href=\\\"
“贪婪”搜索模式(必须转义\和“)。然后通过.*data\\.zip
,我们限制贪婪搜索不超出我们的点这也是我们感兴趣的模式。所以,我们也确保捕获它。然后其余部分是显而易见的。然后替换模式是第二次捕获。
答案 1 :(得分:2)
这个正则表达式将:
dta.zip
<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=\\(['"]?)((?:(?!\1(?:\s|\/>|>)).)*dta\.zip)\\)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/a>
示例文字
注意第一行有一些困难的边缘情况
<a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">
"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a> | <a href=\"../cdf/cdf_errata.htm\">Errata</a> | <a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | August 25, 2011 version </td></tr>",
"ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a> | <a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a> | <a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | July 1, 2013 version<br />ac
<强>匹配强>
[0][0] = <a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">
"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>
[0][1] = "
[0][2] = ../data/anes_timeseries_2012/DifficultToFind_dta.zip
[1][0] = <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[1][1] = "
[1][2] = ../data/cdf/anes_cdfdta.zip
[2][0] = <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[2][1] = "
[2][2] = ../data/anes_timeseries_2012/anes2012TS_dta.zip