Question

我有一套约1000万件物品，看起来像这样：

1234word:something
4321soup:ohnoes
9cake123:itsokay
[...]

现在我需要快速检查具有特定开始的项目是否在集合中。例如

x = "4321soup"
is x+* in a_set:
     print ("somthing that looks like " +x +"* is in the set!")

我如何做到这一点？我考虑使用正则表达式，但我不知道在这种情况下是否可能。

Answer 1

^4321soup.*$

是的，这是可能的。尝试匹配。如果结果是肯定的，你就拥有它。如果它是None你就没有它。

不要忘记设置m和g标记。

参见演示。

http://regex101.com/r/lS5tT3/28

Answer 2

使用str.startswith而不是使用正则表达式，如果你只想匹配字符串的开头，还要考虑你拥有的行数~1000万件

#!/usr/bin/python

str = "1234word:something";
print str.startswith( '1234' );

python，考虑到你的内容在名为＆＃34; mycontentfile＆＃34;

的文件中

>>> with open("mycontentfile","r") as  myfile:
...     data=myfile.read()
... 
>>> for item in data.split("\n"):
...     if item.startswith("4321soup"):
...             print item.strip()
... 
4321soup:ohnoes

Answer 3

Hash-set非常适合检查某些元素的存在，完全。在您的任务中，您需要检查起始部分的存在，而不是完整的元素。这就是为什么更好地使用树或排序序列而不是散列机制（python集的内部实现）。

但是，根据您的示例，您似乎想要在'：'之前检查整个部分。为此，您可以使用这些第一部分构建集合，然后使用集合检查存在是否有用：

items = set(x.split(':')[0] for x in a_set) # a_set can be any iterable

def is_in_the_set(x):
    return x in items

is_in_the_set("4321soup")  # True

Answer 4

在这种情况下，重要的是如何以乐观的方式迭代集合由于您应该检查每个结果，直到找到匹配结果，最好的方法是创建生成器（列表表达式）并执行它直到找到结果。要做到这一点，我应该使用next方法。

a_set = set(['1234word:something','4321soup:ohnoes','9cake123:itsokay',]) #a huge set
prefix = '4321soup' #prefix you want to search
next(x for x in a_set if x.startswith(prefix), False) #pass a generator with the desired match condition, and invoke it until it exhaust (will return False) or until it find something

Answer 5

我目前认为最合理的解决方案是类似于排序的dicts树（key = x和value = y）和树按dicts键排序。 - 不知道如何做到这一点 - 代达罗斯神话

不需要树的 ...只需要一个字典。如果您拥有存储在字典中的键：值对，请说itemdict，您可以写

x = "4321soup"
if x in itemdict:
    print ("something that looks like "+x+"* is in the set!")

如何在集合中查找具有特定起始字符串的项目

5 个答案: