Question

如何修改此正则表达式以将包含连字符或撇号的单词计为一个单词？

@"^(\w+\b.*?){numOfWords}"

谢谢！

编辑：我试图让这个表达式返回前n个单词，同时用＆＃39;或连字符作为一个单词

string substringWords = Regex.Match(stringWords, @"^(\w+\b.*?){" +      numberOfWords + "}").ToString();

Answer 1

((?:\w+(?:(?:[-']\w+)+|\b)(?:\s+|$)){3})将包含任意数量连字符的前3个单词匹配为一个匹配它们也可以被任意数量的空白元素分开。

Answer 2

正如 alpha bravo 所述here

试试这个正则表达式

(?=\S*['-])([a-zA-Z'-]+)

Regex Demo

(?=                 # Look-Ahead
  \S                # <not a whitespace character>
  *                 # (zero or more)(greedy)
  ['-]              # Character in ['-] Character Class
)                   # End of Look-Ahead
(                   # Capturing Group (1)
  [a-zA-Z'-]        # Character in [a-zA-Z'-] Character Class
  +                 # (one or more)(greedy)
)                   # End of Capturing Group (1)

Answer 3

对于像：

这样的字符串

看看新学生奥布莱恩，他来自彼得的班级

来自Vignesh Kumar的正则表达式会将o'brien正确识别为单词，还会he's和peter's

在这种情况下，我认为o'brien应该是一个单词，he's和peter's中的撇号应该被移除。

我认为这可以通过使用预定义的集来指示异常来解决

Answer 4

我到处走走是因为我对意大利语也遇到了同样的问题，即撇号的位置可能取决于其位置，因此撇号可能具有不同的功能，因为（当然）这可能是两个单词之间的撇号，即第一个结尾和第一个第二个是从声带开始，但也可以截断单词的初始声带（省略）或最后一个音节（复音）。因此，例如

perch'io'l giorno e l'ora ch'i vidi'l tuo core un po'triste

（因为我看到您的心有些难过的日期和时间）包含：

perch'io -> perch[é] io (because I) (apostrophe)
'l giorno -> [i]l giorno (and the day) (elision)
e l'ora -> e l[a] ora (and the hour) (apostrophe)
ch'i' vidi -> ch[e] i[o] vidi (in which I saw) (apostrophe and elision together)
'l tuo core -> [i]l tuo cuore (elision)
un po' -> un po[co] (apocopation)

在这种情况下，我建议的解决方案有所不同：

['][a-zA-Z]+|[\S]+['](?=[a-zA-Z]+)|\b\w+[']?

或更好：

['][a-zA-Zàòèéìù]+|[\S]+['](?=[a-zA-Zàòéèìù]+)|[a-zA-Zàòèìéù]+[']?

如果考虑带重音的字母。

这里是demo

正则表达式计算带连字符和撇号的单词

4 个答案: