检查字符串列表以提取某些单词的有效方法

时间:2020-06-15 16:11:30

标签: python

我正在尝试检查20,000个字符串列表,并与某些单词/短语进行比较,以将其正确地分为3类。

这是字符串的示例列表:

  sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

所以我想检查一个字符串是否具有:

    "empty" and "bus" and "empty" then emptyCount += 1

    "order canceled" or "canceled" then cancelcount += 1

    "empty" or "site" or "no empty on site" then site += 1

我有一个执行此操作的代码,但是我认为它没有更高的效率,并且实际上可能缺少一些关键点。有什么建议可以解决吗?

    site = 0
    cancel = 0
    empty = 0
    count = 0
    for i in sample:
        if "empty" and "bus" and "empty" in i:
           emptycount += 1
        elif "order canceled" or "canceled":
           cancelcount += 1
        elif "empty" or "site" or "no empty on site" 
           site += 1

        else:
           count += 1

1 个答案:

答案 0 :(得分:0)

您甚至不需要提取。

您需要做的就是搜索和增加计数

sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

empty_counter = 0
for string_item in sample:
    if 'empty' in string_item:
        empty_counter += 1

print(empty_counter)

如果您正在寻找效率,那么我建议您使用熊猫。根据数据的大小,这将使您的效率提高100倍,这是一个数据科学软件包,这意味着它可以非常快地处理数百万个数据。

#import pandas package.
import pandas as pd

sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]

# create a pandas series
sr = pd.Series(sample) 

#search for match and store results 
results = sr.str.match(pat = '(empty)&(bus)' )

#gives total number of matching items
print(results.shape[0])