Question

我一直在尝试创建一个匹配文章标记的正则表达式并获取所有文本。

这是我的文章标签 -

<article id="post-82" class="post-82 post type-post status-publish format-standard hentry category-publishing">
        <div class="entry-content clearfix">        
                         <div class="abh_box abh_box_up abh_box_drop-down"><ul class="abh_tabs"> <li class="abh_about abh_active">
<p>With India playing host,</p>
    <footer class="entry-meta-bar clearfix"><div class="entry-meta clearfix">
               <span class="comments"><a href="http://www.test.com/blog/emerging-markets/#respond">No Comments</a></span>           

      </div></footer>
    </article>

我需要文章标签内的所有内容。到目前为止，我已经尝试了以下正则表达式 -

<article (.*?)</article>

 (?:<article>)(.*?)(?:</article>)

它们都不起作用。请帮助。

Answer 1

不要使用正则表达式来解析HTML。使用Html解析器，如Html Agility pack

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var result = doc.DocumentNode.SelectNodes("article").FirstOrDefault();

Answer 2

你可以试试这个正则表达式：

<[article][^>]*>((.|\n)*?)<\/article>

https://regex101.com/r/oOJ9bt/2

Answer 3

您不希望将regex用于此类操作，并且您不需要加载XML解析器。只需在要包含HTML的元素上使用.getAttribute("innerHTML")。

例如，这只通过ID获取您提供的HTML中的article元素。

System.out.println(driver.findElement(By.id("post-82")).getAttribute("innerHTML"));

这将获取页面上所有文章的HTML。

for (WebElement article : driver.findElements(By.tagName("article")))
{
    System.out.println(article.getAttribute("innerHTML"));
}

无法构建正则表达式以匹配文章标记

3 个答案: