我正在尝试检查20,000个字符串列表,并与某些单词/短语进行比较,以将其正确地分为3类。
这是字符串的示例列表:
sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]
所以我想检查一个字符串是否具有:
"empty" and "bus" and "empty" then emptyCount += 1
"order canceled" or "canceled" then cancelcount += 1
"empty" or "site" or "no empty on site" then site += 1
我有一个执行此操作的代码,但是我认为它没有更高的效率,并且实际上可能缺少一些关键点。有什么建议可以解决吗?
site = 0
cancel = 0
empty = 0
count = 0
for i in sample:
if "empty" and "bus" and "empty" in i:
emptycount += 1
elif "order canceled" or "canceled":
cancelcount += 1
elif "empty" or "site" or "no empty on site"
site += 1
else:
count += 1
答案 0 :(得分:0)
您甚至不需要提取。
您需要做的就是搜索和增加计数
sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]
empty_counter = 0
for string_item in sample:
if 'empty' in string_item:
empty_counter += 1
print(empty_counter)
如果您正在寻找效率,那么我建议您使用熊猫。根据数据的大小,这将使您的效率提高100倍,这是一个数据科学软件包,这意味着它可以非常快地处理数百万个数据。
#import pandas package.
import pandas as pd
sample = ["the empty bus behind me", "the facility is close", "my order was canceled", "no empty on site", "no bus for me to move"]
# create a pandas series
sr = pd.Series(sample)
#search for match and store results
results = sr.str.match(pat = '(empty)&(bus)' )
#gives total number of matching items
print(results.shape[0])