Python字符串搜索无论字符序列

时间:2018-02-08 15:01:05

标签: python

我想创建一个应用程序来检查用户输入的单词是否包含来自单独文本文件的单词/单词(例如input =' teeth',单独的文件包含单词' eet& #39;)无论字符的顺序如何,它都应返回True。

我查看了这个帖子matching all characters in any order in regex,这很酷,因为它使用set()工作。问题是,set()不允许你使用重复的字符(例如,eeet,aaat)。

我想知道如何解决这个问题?

2 个答案:

答案 0 :(得分:2)

我会从两个字符串创建一个collections.Counter对象,对字符进行计数,然后减去dicts,测试结果dict是否为空(这意味着字符串包含具有基数的子字符串)

导入集合

def contains(substring, string):
    c1 = collections.Counter(string)
    c2 = collections.Counter(substring)
    return not(c2-c1)

print(contains("eeh","teeth"))
print(contains("eeh","teth"))

结果:

True
False

请注意,您的示例不代表

>>> "eet" in "teeth"
True

这就是我改变它的原因。

答案 1 :(得分:2)

我知道它不太可能,但如果性能对于非常大的输入真的很重要,你可以避免需要创建第二个Counter并直接迭代子字符串的字符,允许如果你用完一个给定的角色,就提前终止。

In [26]: def contains2(string, substring):
    ...:     c = Counter(string)
    ...:     for char in substring:
    ...:         if c[char] > 0:
    ...:             c[char] -= 1
    ...:         else:
    ...:             return False
    ...:     return True
    ...: 

In [27]: contains2("teeth", "eeh")
Out[27]: True

In [28]: contains2("teeth", "ehe")
Out[28]: True

In [29]: contains2("teth", "ehe")
Out[29]: False

In [30]: contains2("teth", "eeh")
Out[30]: False

In [31]: def contains(string, substring):
    ...:     c1 = collections.Counter(string)
    ...:     c2 = collections.Counter(substring)
    ...:     return not(c2-c1)
    ...: 

In [32]: contains("teth", "ehe")
Out[32]: False

In [33]: contains("teeth", "ehe")
Out[33]: True

In [34]: contains("teeth", "eeh")
Out[34]: True

In [35]: %timeit contains("teeth", "eeh")
19.6 µs ± 94.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [36]: %timeit contains2("teeth", "eeh")
9.59 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [37]: %timeit contains("friday is a good day", "ss a")
22.9 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [38]: %timeit contains2("friday is a good day", "ss a")
9.52 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)