Question

我有以下的HTML，我试图用R中的gregexpr函数运行正则表达式

<div class=g-unit>
<div class=nwp style=display:inline>
<input type=hidden name=cid value="22144">
<input autocomplete=off class=id-fromdate type=text size=10 name=startdate value="Sep 6, 2013"> -
<input autocomplete=off class=id-todate type=text size=10 name=enddate value="Sep 5, 2014">
<input id=hfs type=submit value=Update style="height:1.9em; margin:0 0 0 0.3em;">
</div>
</div>
</div>
<div id=prices class="gf-table-wrapper sfe-break-bottom-16">
<table class="gf-table historical_price">
<tr class=bb>
<th class="bb lm lft">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
...
...
</table>
</div>

我试图通过使用以下正则表达式

从这个html中提取表部分

<table\\s+class="gf-table historical_price">.+<

当我使用perl = FALSE运行gregexpr函数时，它工作正常，我得到一个结果但是，如果我用perl = TRUE运行它，我什么都不回来。它似乎不匹配

有谁知道为什么结果不同于只打开和关闭Perl？非常感谢提前！

Answer 1

似乎在正则表达式the dot is able to match newline characters的扩展模式中，在perl模式下不是这种情况。要使其在perl模式下工作，您需要使用(?s)修饰符使点能够匹配换行符：

> m <- gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', str, perl = TRUE)

在许多正则表达式中，默认情况下点不匹配换行符，可能会使逐行作业变得更加方便。

内联修饰符s中的(?s)代表＆＃34;单行＆＃34;。换句话说，这意味着即使有换行符，整个字符串也会被视为单行（对于点）。

Answer 2

您需要使用内联(?s)修饰符强制点匹配所有字符，包括换行符。

perl=T参数切换到实现正则表达式模式匹配的（PCRE）库。

gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', x, perl=T)

但是如评论中所述，建议使用解析器来执行此操作。我会开始使用XML库。

cat(paste(xpathSApply(htmlParse(html), '//table[@class="gf-table historical_price"]', xmlValue), collapse = "\n"))

无论Perl是TRUE还是FALSE，R中的gregexpr函数都会返回不同的结果

2 个答案: