Question

我想找到＆＃34; help＆＃34;在一个句子里。这本身就是一件容易的事。但是，在某些情况下，此单词可能会写为heelp或hhelp，基本上包含的字符数通常比通常多。当然，有些例子比其他例子更现实。

寻找＆＃34;帮助＆＃34;的基本正则表达式（旁观资本化差异 - (?i)可以涵盖那个）是：

(help)

但是，此正则表达式仅检测直接单词，而不考虑可添加的额外字符。

替换双字符不是一个选项，因为有些单词通常（＆lt; ---）是彼此之后的双字符。

所以使用正则表达式，有什么方法可以找到有＆＃34;帮助＆＃34;以某种方式或其他方式？

测试文本（解释正则表达式是否应该找到它）

heelp (match)
help (match)
help (match)
heeeelp (match)
hhhheeeelllllpppp (match)
heeeklp (match)
hlep (no match)
helper (no match)
helperp (no match) 
hhhheeeeekklllllpppp (match)
hpeepr33erlrpetertp (no match)
heplp (match)
hepl (no match)
heeeeellllllllpppppppppppl (no match)

应忽略数字。

h+e+l+p

（旁边的边界）将排除实例heplp。

至于每种类型的字符数量，它会有所不同。这就是我不能制作一个String数组的原因。

如果它是相关的，我使用的编程语言是Java。此外，外壳并不重要。如果有必要，可以在检查之前进行低级，或者我可以添加不区分大小写标志。

TL：DR; 目标是在中间有其他字符的情况下，按顺序查找字符（在本例中为＃34; help＆＃34;）或者可能与之前的字符不同）单词中的字符是检测目标（在这种情况下，再次是help）。

Answer 1

我将展示为单词help制作正则表达式所需的步骤，但要求不明确，规则并不严格，因此通常存在一些缺点。

\bh+[a-z&&[^e]]*e+[a-z&&[^le]]*l+[a-z&&[^ p  l  e ]]*p+\b
           ^             ^^              ^  ^  ^
           |             ||              |  |--|-> [#2]
           |             ||              |-> [#1]
           |             ||-> Previous char(s) [#2]
           |             |-> [#1]
           |-> Next immediate character [#1]

[a-z&&[^lep]]表示除l，e或p

复制/粘贴正则表达式：

\bh+[a-z&&[^e]]*e+[a-z&&[^le]]*l+[a-z&&[^lep]]*p+\b

Live demo

Answer 2

这不是一件容易的事，你需要一个很好的NLP（自然语言处理）库。

对于可能是Apache OpenNLP项目的Java。

对于Perl，有一些模块，如Lingua::Stem（如果您在阻止之后）或PHP soundex（如果您使用类似的拼音字样）

Answer 3

我向您提出以下（一般）解决方案：

压缩每个单词，以便没有任何重复的字母
获取匹配的词典
匹配字典中具有最小Levenshtein距离的单词

压缩应该产生这个：

heelp -> help
help -> help
heeeelp -> help
hhhheeeelllllpppp -> help
heeeklp -> heklp
hlep -> hlep
helper -> helper

两个单词之间的Levenshtein距离（LD(word1, word2)）是要改变以使它们等于的字符数。例如：

hhhheeeelllllpppp -> help -> LD(help, help) = 0, LD(help, helper) = 2 <- help match
heeeklp -> heklp -> LD(heklp, help) = 1, LD(heklp, helper) = 3 <- help match
hlep -> hlep -> LD(hlep, help) = 2, LD(hlep, helper) = 3 <- help match
helper -> helper -> LD(helper, help) = 2, LD(helper, helper) = 0 <- helper match

这是我的解决方案：

import java.util.*;

public class LevenshteinDistance {                                               
    private static int minimum(int a, int b, int c) {                            
        return Math.min(Math.min(a, b), c);                                      
    }                                                                            

    public static int computeLevenshteinDistance(CharSequence lhs, CharSequence rhs) {      
        int[][] distance = new int[lhs.length() + 1][rhs.length() + 1];        

        for (int i = 0; i <= lhs.length(); i++)                                 
            distance[i][0] = i;                                                  
        for (int j = 1; j <= rhs.length(); j++)                                 
            distance[0][j] = j;                                                  

        for (int i = 1; i <= lhs.length(); i++)                                 
            for (int j = 1; j <= rhs.length(); j++)                             
                distance[i][j] = minimum(                                        
                        distance[i - 1][j] + 1,                                  
                        distance[i][j - 1] + 1,                                  
                        distance[i - 1][j - 1] + ((lhs.charAt(i - 1) == rhs.charAt(j - 1)) ? 0 : 1));

        return distance[lhs.length()][rhs.length()];                           
    }

  public static String compress(String s) {
    char[] chars = s.toCharArray();
    Character last_char = null;

    StringBuilder sb = new StringBuilder();
    for (Character c:chars) {
      if(c != last_char) {
        sb.append(c);
        last_char = c;
      }
    }
    return sb.toString();
  }

    public static void main(String[] argv) {
      String[] strings = {"heelp", "help", "heeeelp", "hhhheeeelllllpppp", "heeeklp", "hlep", "helper"};
      String[] dict = {"help", "helper"};

      String match = "", c;
      int min_distance, distance;
      for(String s : strings) {
        c = compress(s);
        min_distance = computeLevenshteinDistance(c, "");

        for(String d : dict) {
          distance = computeLevenshteinDistance(c, d);
          System.out.println("compressed: "+c+ " dict: "+d+" distance: "+Integer.toString(distance));
          if(distance < min_distance) {
            match = d;
            min_distance = distance;
          }
        }

        System.out.println(s + " matches " + match);
      }
    }                                                                            
}

这是输出：

compressed: help dict: help distance: 0
compressed: help dict: helper distance: 2
heelp matches help
compressed: help dict: help distance: 0
compressed: help dict: helper distance: 2
help matches help
compressed: help dict: help distance: 0
compressed: help dict: helper distance: 2
heeeelp matches help
compressed: help dict: help distance: 0
compressed: help dict: helper distance: 2
hhhheeeelllllpppp matches help
compressed: heklp dict: help distance: 1
compressed: heklp dict: helper distance: 3
heeeklp matches help
compressed: hlep dict: help distance: 2
compressed: hlep dict: helper distance: 3
hlep matches help
compressed: helper dict: help distance: 2
compressed: helper dict: helper distance: 0
helper matches helper

Answer 4

\bh+\w{0,1}e+\w{0,1}l+\w{0,1}p+\b

在regex101.com上测试javascript，以获得样本输入的所需结果。它比使用“*”更“紧”，它只允许零个或一个杂散字母。这符合我的印象，即您允许任何数字中的正确字母，但只有两个正确字母之间的错误字母。

将匹配“帮助”，每个正确字母的任何数字（> 0）都按正确的顺序排列。在每两个（组）正确字母之间，允许任何其他“单词”字母（数字，字母，“_”）中的一个或零。这个词必须先用第一个正确的字母开头，然后用最后一个正确的字母结束。

为了更准确地选择正确字母之间允许的字母，您可以使用[alltheallowedletters]，以防您不喜欢\w集。

我将?替换为{0,1}以证明该语法的灵活性。

Answer 5

这是有效的。尝试任何在线正则表达式测试程序，以确保它是您正在寻找的：备注：这是任意数量的不需要的字母，如果你需要1个字母 - “\ w *”模式应该替换为“\ w？” （并在相应的java代码中）

\bh+\w*e+\w*l+\w*p+\b

更新*

这里是java代码，可以在任何单词上获得这样的正则表达式

public static String getRegExForWord(String word){
        char[] chars =  word.toCharArray();
        StringBuilder pattern = new StringBuilder("\\b");
        for (int i = 0; i < chars.length-1; i++) {
            pattern.append(chars[i]).append("+\\w*");
        }
        return pattern.append(chars[chars.length - 1]).append("+\\b").toString();
    }

Answer 6

更新版本

h = h.trim();
h = h.replaceAll("\\s+", "\n");

Pattern p = Pattern.compile("(h+.*?e+.*?l+.*?p+)", Pattern.MULTILINE);
Matcher m = p.matcher(h);
while(m.find())
{
    System.out.println(m.group(1));
}

Answer 7

使用普通正则表达式查找help后，您需要使用“编辑距离”来查找类似的模式。它是用于拼写检查和单词推荐的指标。例如，如果您从帮助中返回编辑距离为1的所有单词，您将获得：

helpp
heelp
hellp
hel
belp
...

从help编辑距离2：

heeelp
helppp
hhellp

使用NLTK（Python NLP包），可以通过以下方式实现：

my_word = 'help'
corpus = {'w1', 'w2'} # Set of all words in your corpus

word_distance = {}

for word in corpus:
    if nltk.edit_distance(my_word, word) <= 2:
        word_distance[word] = nltk.edit_distance(my_word, word)

# Sort dict by value if you choose to return greater edit distances
results = sorted(word_distance, key=word_distance.get, reverse=True)
print(results[:10])

您可以通过正则表达式施加额外限制以获得更好的结果。例如，nltk.edit_distance返回的所有内容只有以h开头并以p结尾才可以接受。

正则表达式：找到带有额外字符的单词

7 个答案: