Question

我正在寻找RegEx来返回段落中的第一个[n]单词，或者如果段落包含少于[n]个单词，则返回完整的段落。

例如，假设我最多需要前7个单词：

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

我会得到：

one two <tag>three</tag> four five, six seven

对包含少于请求的字数的段落使用相同的RegEx：

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

只需返回：

one two <tag>three</tag> four five.

我对此问题的尝试产生了以下RegEx：

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

但是，这只返回第一个单词 - “one”。它不起作用。我觉得。*？（在\ w + \ b之后）导致问题。

我哪里错了？任何人都可以提出一个有效的RegEx吗？

仅供参考，我正在使用.Net 3.5的RegEX引擎（通过C＃）

非常感谢

Answer 1

好的，完成重新编辑以确认新的“规范”：）

我很确定你不能用一个正则表达式做到这一点。最好的工具肯定是HTML解析器。我能用正则表达式得到的最接近的是两步法。

首先，用以下内容隔离每个段落的内容：

<p>(.*?)</p>

如果段落可以跨越多行，则需要设置RegexOptions.Singleline。

然后，在下一步中，迭代您的匹配并在每个匹配Group[1].Value上应用以下正则表达式一次：

((?:(\S+\s+){1,6})\w+)

这将匹配由空格/制表符/换行符分隔的前七个项目，忽略任何尾随标点符号或非单词字符。

但它会将以空格分隔的标签视为其中一项，i。即在

One, two three <br\> four five six seven

它只会匹配到six。我想那是正则表达式，没有办法解决这个问题。

Answer 2

使用HTML解析器获取第一段，展平其结构（即删除段落中的装饰HTML标记）。
搜索第n个空白字符的位置。
将子串从0到该位置。

编辑：我删除了第2步和第3步的正则表达式提案，因为它错了（感谢评论者）。此外，HTML结构需要展平。

Answer 3

我遇到了同样的问题，并将一些Stack Overflow答案合并到了这个课程中。它使用HtmlAgilityPack，这是一个更好的工具。拨打：

 Words(string html, int n)

获得n个单词

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

圣诞快乐！

需要RegEx才能返回第一段或前n个单词

3 个答案: