Question

我正在尝试检查20,000个字符串列表，并与某些单词/短语进行比较，以将其正确地分为3类。

这是字符串的示例列表：

  sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

所以我想检查一个字符串是否具有：

    "empty" and "bus" and "empty" then emptyCount += 1

    "order canceled" or "canceled" then cancelcount += 1

    "empty" or "site" or "no empty on site" then site += 1

我有一个执行此操作的代码，但是我认为它没有更高的效率，并且实际上可能缺少一些关键点。有什么建议可以解决吗？

    site = 0
    cancel = 0
    empty = 0
    count = 0
    for i in sample:
        if "empty" and "bus" and "empty" in i:
           emptycount += 1
        elif "order canceled" or "canceled":
           cancelcount += 1
        elif "empty" or "site" or "no empty on site" 
           site += 1

        else:
           count += 1

Answer 1

您甚至不需要提取。

您需要做的就是搜索和增加计数

sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

empty_counter = 0
for string_item in sample:
    if 'empty' in string_item:
        empty_counter += 1

print(empty_counter)

如果您正在寻找效率，那么我建议您使用熊猫。根据数据的大小，这将使您的效率提高100倍，这是一个数据科学软件包，这意味着它可以非常快地处理数百万个数据。

#import pandas package.
import pandas as pd

sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

# create a pandas series
sr = pd.Series(sample) 

#search for match and store results 
results = sr.str.match(pat = '(empty)&(bus)' )

#gives total number of matching items
print(results.shape[0])

检查字符串列表以提取某些单词的有效方法

1 个答案: