Question

我试图检查输入的字符串是否有名字。我提供的数据是来自Facebook的每个名字和姓氏。

我希望我的程序要做的是输入＆＃34; johnsmith123＆＃34; （例如）并返回[＆＃39; john＆＃39;，＆＃39; smith＆＃39;，＆＃39; 123＆＃39;]。如果＆＃39; johns＆＃39;和＆＃39; mith＆＃39;列表中的名字，我希望它返回[＆＃39; john＆＃39;，＆＃39; smith＆＃39;，＆＃39; 123＆＃39;，＆＃39; johns＆＃39;，＆＃39; MITH＆＃39]。基本上：列表中可以组成输入短语的单词的每种可能组合。

我知道正则表达式尝试对于查找来说真的非常快。使用名为RegexFormat 7的工具，我将wordlist转换为50mb正则表达式trie。

以下是我正在尝试运行的代码，使用该trie：

import io
import re

with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
        TempRegex = myfile.read()

regex = re.compile(TempRegex)

while True == True:
    Password = input("Enter a phrase to be split: ")

    Words = re.findall(regex, Password)

    print(Words)

程序永远不会到达输入部分。我假设编译这么大的正则表达式trie需要很长时间。

我需要知道的是，如果有一些方法可以执行此编译过程一次，将正则表达式对象保存到我的磁盘，只需加载要用于模块的预编译对象，而不是每次编译时间？

正在编制占用这么多时间。我知道搜索实际上会很快发生。如果我可以进行一次编译过程，我可以在一夜之间运行编译......

如果这不可行，我还能做些什么？我提供的数据是来自Facebook的每个名字和姓氏的100mb单词列表，以及从该单词表中派生的正则表达式

Answer 1

我怀疑单个大规模正则表达式是最好的解决方案。所有可能的名字的单个哈希表可能更快。

all_first_names = set(['dan', 'bob', 'danny'])

username = 'dannysmith123'

# Get "extra" parts of the username if they exist
m = re.match(r'^([a-zA-Z]+)(.*)$', username)
name, extra = m.group(1), m.group(2)

# Get a list of possible first/last name splits
# [('d', 'annysmith'), ('da', 'nnysmith'), ...]
name_splits = [(name[:i], name[i:]) for i in range(1, len(name)+1)]

# Check each one of these splits to see if the first name
# is present in the master first name list, if so, add it to
# the list of valid matches.
match_list = []
for ns in name_splits:
    if ns[0] in all_first_names:
        match_list.extend(ns)
        if extra:
            match_list.append(extra)
            extra = None

在字符串中查找名称

1 个答案: