Question

我需要解析大约15,000个文件，其中可能包含我所拥有的列表中的一个或多个字符串/数字。我需要用匹配的字符串分隔文件。

给定一个字符串：3423423987，它可以独立显示为＆＃34; 3423423987＆＃34;，或者＆＃34; 3423423987_1＆＃34;或＆＃34; 3423423987_1a＆＃34;，＆＃34; 3423423987-1a＆＃34;，但它也可能是＆＃34; 2133423423987＆＃34;。但是，我只想检测匹配序列，它不是另一个数字的一部分，只有当它有某种后缀时。

所以3423423987_1是可以接受的，但13423423987不是。

我在使用正则表达式时遇到了麻烦，但实际上并没有使用它。

简单地说，如果我用可能的正面和负面的列表模拟这个，我应该得到7个点击，对于给定的列表。我想把文本提取到单词的结尾，这样我就可以记录下来了。

这是我的代码：

def check_text_for_string(text_to_parse, string_to_find):
    import re
    matches = []
    pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
    return re.findall(pattern, text_to_parse)

if __name__ =="__main__":
    import re
    word_to_match = "3423423987"
    possible_word_list = [
                    "3423423987_1 the cake is a lie", #Match
                    "3423423987sdgg call me Ishmael",  #Not a match
                    "3423423987 please sir, can I have some more?", #Match
                    "3423423987", #Match
                    "3423423987 ", #Match
                    "3423423987\t", #Match
                    "adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
                    "1233423423987", #Not a match
                    "A3423423987", #Not a match
                    "3423423987-1a\t", #Match
                    "3423423987.0", #Not a match
                    "342342398743635645" #Not a match
                    ]

    print("%d words in sample list."%len(possible_word_list))
    print("Only 7 should match.")
    matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
    print("%d matched."%len(matches))
    print(matches)

但显然，这是错误的。有人可以帮助我吗？

Answer 1

看起来你只是想确保这个数字不匹配，例如，浮点数。然后你需要使用lookarounds，lookbehind和lookahead来禁止前后数字点。

(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)

请参阅regex demo

要匹配“前缀”（或者，最好在这里称它们为“后缀”），您需要添加\S*（零个或多个非空格）或{ {1}}（模式末尾的(?:[_-]\w+)?或-的可选序列，后跟1个字符）。

详细：

_ - 如果我们在当前位置之前有一个数字和一个点
(?<!\d\.) - 字边界或(?:\b|_)（我们需要_是字char）
_ - 搜索字符串
3423423987 - 同上
(?:\b|_) - 如果点+位在当前位置之后，则匹配失败。

所以，使用

(?!\.\d)

请参阅Python demo

如果可以有pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find之类的花车，您还需要在第一个之后添加另一个lookbehind Text with .3423423987 float value：(?<!\.)

Answer 2

您可以使用此模式：

(?:\b|^)3423423987(?!\.)(?=\b|_|$)

(?:\b|^)断言左边没有其他数字

(?!\.)断言该数字后面没有点

(?=\b|_|$)断言数字后跟非单词字符，下划线或无字符

使用Python匹配文件中的数字

2 个答案: