我有一个这样的数据集(简化):
foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']
我有一份我要分类的项目清单:
items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']
我希望能够根据items
中出现的项目将foods_dict
列表中的项目分配到较小的列表中。我认为这些子列表实际上应该是sets
,因为我不希望在那里有任何重复。
我在代码中的第一次传递是这样的:
fruits = set()
veggies = set()
others = set()
for item in items:
if item in foods_dict.get('fruit'):
fruits.add(item)
elif item in foods_dict.get('veg'):
veggies.add(item)
else:
others.add(item)
但这对我来说似乎效率低下且不必要地冗长。我的问题是,如何改进这些代码?我猜这里的列表理解可能很有用,但我不确定列表的数量。
答案 0 :(得分:5)
对于有效的解决方案,您希望尽可能避免显式循环:
items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies
这几乎肯定比使用显式循环更快。特别是如果水果列表很长,那么做item in foods_dict['fruit']
会非常耗时。
目前解决方案之间的非常简单基准:
In [5]: %%timeit
...: items2 = set(items)
...: fruits = set(foods_dict['fruit']) & items2
...: veggies = set(foods_dict['veg']) & items2
...: others = items2 - fruits - veggies
...:
1000000 loops, best of 3: 1.75 us per loop
In [6]: %%timeit
...: fruits = set()
...: veggies = set()
...: others = set()
...: for item in items:
...: if item in foods_dict.get('fruit'):
...: fruits.add(item)
...: elif item in foods_dict.get('veg'):
...: veggies.add(item)
...: else:
...: others.add(item)
...:
100000 loops, best of 3: 2.57 us per loop
In [7]: %%timeit
...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
...: others = set(items) - veggies - fruits
...:
100000 loops, best of 3: 3.34 us per loop
当然,在选择之前你应该用“真正的输入”进行一些测试。我不知道你的问题中的元素数量,并且时间可能会随着更大的输入而改变很多。无论如何,我的经验告诉我,至少在CPython中,显式循环往往比仅使用内置操作慢。
Edit2:输入较大的示例:
In [9]: foods_dict = {}
...: foods_dict['fruit'] = list(range(0, 10000, 2))
...: foods_dict['veg'] = list(range(1, 10000, 2))
In [10]: items = list(range(5, 10000, 13)) #some odd some even
In [11]: %%timeit
...: fruits = set()
...: veggies = set()
...: others = set()
...: for item in items:
...: if item in foods_dict.get('fruit'):
...: fruits.add(item)
...: elif item in foods_dict.get('veg'):
...: veggies.add(item)
...: else:
...: others.add(item)
...:
10 loops, best of 3: 68.8 ms per loop
In [12]: %%timeit
...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
...: others = set(items) - veggies - fruits
...:
10 loops, best of 3: 99.9 ms per loop
In [13]: %%timeit
...: items2 = set(items)
...: fruits = set(foods_dict['fruit']) & items2
...: veggies = set(foods_dict['veg']) & items2
...: others = items2 - fruits - veggies
...:
1000 loops, best of 3: 445 us per loop
正如您所看到的,仅使用内置插件比显式循环快约20倍。
答案 1 :(得分:1)
这可能会做你想要的(例如蔬菜案例):
veggies = set(elem for elem in items if elem in foods_dict['veg'])
更全面:
veggies = set(elem for elem in items if elem in foods_dict['veg'])
fruits = set(elem for elem in items if elem in foods_dict['fruit'])
others = set(items) - veggies - fruits
答案 2 :(得分:1)
这样的事情(仅使用set操作避免列表推导):
fruits = set(items).intersection(set(foods_dict['fruit']))
veggies = set(items).intersection(set(foods_dict['veg']))
others = set(items).difference(veggies.union(fruits))
如果可以提供帮助,你可以开始使用集合来避免set()转换。
希望有所帮助!
编辑:似乎你关心效率或冗长(和“pythonic”)。如果您关注效率,请记住在字节码编译器和解释器之间,您不知道正在实现哪些优化(如果有)。在如此高的水平上优化事物通常很困难。可能,但首先需要一些基准测试。如果你担心是pythonic,我会尝试更高级别(我可以在这里说声明吗?或者我们还没有:))。换句话说,不是循环并告诉python究竟应该如何决定哪个项目在哪里,我会尝试可读,清晰和简洁。我认为(因为我写了上面的内容)这种风格告诉读者你想要对项目列表做些什么。
希望这会有所帮助,所有这些只是我的意见,应该采取一些限制。
答案 3 :(得分:1)
如果您有更多类别,这里会更通用。 (因此,每个类别都没有单独的变量。)
from collections import defaultdict
foods_dict = {}
foods_dict['fruit'] = set(['apple', 'orange', 'plum'])
foods_dict['veg'] = set(['cabbage', 'potato', 'carrot'])
items = set(['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg'])
dict_items = set.union(*foods_dict.values())
assignments = defaultdict(set)
assignments['other'] = dict_items.copy()
for key in foods_dict.keys():
assignments[key] = foods_dict[key] & items
assignments['other'] -= foods_dict[key]