Question

我是正则表达式的初学者，所以我遇到了麻烦。

鉴于下面的字符串，我如何编写一个匹配＆＃34; 69144＆＃34;的正则表达式？一些周围的文本也可以，只要我可以缩小它。

Citations</a></td><td class="cit-borderleft cit-data">69144</td><td class="cit-borderleft
cit data">22047</td></tr><tr class="cit-borderbottom"><td class="cit-caption"><a href="#"
class="cit-dark-link" onclick="return citToggleIndexDef('h_index_definition')" title='
h-index is the largest number h such that h publications have at least h citations. 
The second column has the &quot;recent&quot; version of this metric which is the largest 
number h such that h publications have at least h new citations in the last 5 years.
 '>h-index</a></td><td class="cit-borderleft cit-data">88</td>

我为字符串难以阅读而道歉。

Answer 1

假设您正在尝试提取位于第一个td单元格中的数字，搜索标记的开始和结束以及使用子字符串来提取内容是一种比正则表达式更容易的方法。

// text contains the HTML from your question

int tdIndex = text.indexOf("<td");
int endTdIndex = text.indexOf(">", tdIndex + 1);
int endTdTagIndex = text.indexOf("</td>", endTdIndex + 1);

String numString = text.substring(endTdIndex + 1, endTdIndex - 1);

// numString now contains 69144

如果您需要更深入到HTML中的td单元格的内容，那么您可以在循环中使用以下内容搜索以后的td标记：

tdIndex = text.indexOf("<td",tdIndex+1);

您必须知道您之后使用的是哪个标签（例如，＆＃34;第三个td＆＃34;）并且知道它前面总会有相同数量的td标签，但鉴于这两个假设，这个代码将为您提供最少的修改。

如果你不能对代码的格式做出假设，那么我的第二个Reimeus＆＃39;回答HTML解析器可以证明非常有用。

Answer 2

解析HTML的一种方法是使用 XPath ，一个包含java的库。 XPath所做的是遍历XML / HTML文档的“树”并获取节点的值（标记内的内容）。该库易于使用，易于学习，无需下载库。有关此主题的更多信息，请参见New Think Tank Xpath Tutorial

如何为这个疯狂的东西写一个REGEX表达式？

2 个答案: