Don't parse HTML with regex

Question

我想从锚标记中提取内部文本以及正则表达式中的内部html标记。我已尝试但无法找到。我在下面提供了样本结构。

我的正则表达式是:( class =＆＃34; related-article＆＃34;（？：\ s | \ n））href =＆＃34;（。？）＆＃34;（大于。（*）？）＆＃34;

我需要匹配以下html内容中的正则表达式（tag）：

<a class="related-article" href="10.1182/blood-2017-11-812990">
                 <i>Blood</i> Commentary</a> on this article in this issue.</p>

Answer 1

Don't parse HTML with regex

如果要从HTML中提取数据，请使用 XPath 。

Using XPath in Java

（你的问题中的标签表示Java。或者你的意思是Javascript？）

你的问题看起来像这样：

我不是Java用户，而是使用C＃编程，因此请将此代码作为伪代码方向建议，而不是复制粘贴编译示例。

XPathExpression expr = xpath.compile("//p/a[@class='related-article']");
NodeList list = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
foreach(Node node in list)
{
    string text = node.InnerText;
    string href = node.Attributes["href"].Value;
}

Answer 2

您可能会发现以下内容：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public static void main(String[]) {
   String sample = "<!---DOCTYPE><html><body></body></html>";
   Pattern p = Pattern.compile("<(.*?)>");
   Matcher m = p.matcher(sample); 
   while (m.find()) {
      String group = m.group(1);
      if (group.contains("!") {
         continue;
      }
      System.out.print(group);
   }  
}

返回：htmlbody/body/html

正则表达式从锚标签中提取内部文本以及内部html标签

2 个答案:

Don't parse HTML with regex

Using XPath in Java

你的问题看起来像这样：