如何提取一个匹配某些模式的字符串,并在两个其他字符串之间休息

时间:2013-07-21 17:35:45

标签: regex r gsub

很抱歉,如果这是重复的......我不清楚SO上已有的如何执行此特定任务。

我的目标是在某些HTML代码中找到压缩文件的文件名。文件名在<a href=...> html块内,因此人们很容易找到。

这里有一些代码可以重现我正在看的内容:

# character vector with two strings from my html file
string.examples <-
    c("ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>&nbsp; | &nbsp;<a href=\"../cdf/cdf_errata.htm\">Errata</a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;August 25, 2011 version </td></tr>", 
    "ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a>&nbsp; | &nbsp;<a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a>&nbsp; |  &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;July 1, 2013 version<br />"
)

埋在第一行的深处,有文字<a href=\"../data/cdf/anes_cdfdta.zip\",在第二行,有文字<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\"

从这两行开始,我想提取../data/cdf/anes_cdfdta.zip../data/anes_timeseries_2012/anes2012TS_dta.zip,因为它们包含文字dta.zip,因为它们以{{1}开始然后以<a href=\"

结束

我想要一些地方:

\"

使用..

生成长度为2的字符向量
x <- some.regex.function( string.examples )

2 个答案:

答案 0 :(得分:3)

在这里,我假设您正在寻找的模式在a href=\"之后开始,以dta.zip结束。因此,我们的想法是使用贪婪搜索来完成所有a href直到dta.zip。此外,我们捕获每个部分并用所需的捕获替换搜索到的字符串。

gsub("(.*a href=\\\")(.*dta\\.zip)(.*)$", "\\2", string.examples)

前面提到的.*a href=\\\"“贪婪”搜索模式(必须转义\和“)。然后通过.*data\\.zip,我们限制贪婪搜索不超出我们的点这也是我们感兴趣的模式。所以,我们也确保捕获它。然后其余部分是显而易见的。然后替换模式是第二次捕获。

答案 1 :(得分:2)

描述

这个正则表达式将:

  • 找到值为dta.zip
  • 的锚标记href值
  • 避免有问题的边缘案例

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=\\(['"]?)((?:(?!\1(?:\s|\/>|>)).)*dta\.zip)\\)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/a>

实施例

示例文字

注意第一行有一些困难的边缘情况

<a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">

"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>&nbsp; | &nbsp;<a href=\"../cdf/cdf_errata.htm\">Errata</a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;August 25, 2011 version </td></tr>", 
    "ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a>&nbsp; | &nbsp;<a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a>&nbsp; |  &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;July 1, 2013 version<br />ac

<强>匹配

[0][0] = <a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">

"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>
[0][1] = "
[0][2] = ../data/anes_timeseries_2012/DifficultToFind_dta.zip


[1][0] = <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[1][1] = "
[1][2] = ../data/cdf/anes_cdfdta.zip


[2][0] = <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[2][1] = "
[2][2] = ../data/anes_timeseries_2012/anes2012TS_dta.zip