我有一个列表如下所示:
[
['number_one', 'number_two', 3, 'number_six', 'fruit_apple'],
['number_one', 'fruit_apple' 'number_two', 'number_four'],
['number_two', 'number_two', 'fruit_apple' 'number_three', 'number_four', 4],
['number_three', 'fruit_apple' 'number_two', 'number_three', 'number_four'],
['number_four', 'fruit_apple' 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9, 'fruit_orange'],
...
]
我想用占位符替换所有低频率元素(比如所有列表中少于2次出现)。例如,如果一个整数元素少于2次出现,它将被integer_placeholder
替换,如果以number_
开头的字符串元素少于2次出现,它将被{替换为{ {1}}。水果也一样。
预期结果(阈值为2):
stringnumber_placeholder
当然,这可以通过在列表和一些嵌套循环上迭代至少两次来完成。但有没有简短,简单,高效这样做?
编辑:我的列表包含4881个子列表,每个子列表平均包含992个元素
答案 0 :(得分:2)
我会使用collections.Counter()
来计算实例,然后使用循环来进行替换。像这样:
a = [
['number_one', 'number_two', 3, 'number_six'],
['number_one', 'number_two', 'number_four'],
['number_two', 'number_two', 'number_three', 'number_four', 4],
['number_three', 'number_two', 'number_three', 'number_four'],
['number_four', 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9]
]
from collections import Counter as cC
c = cC(x for y in a for x in y) # generator expression to flatten the list of lists
replacement = {int: 'integer_placeholder', str: 'string_placeholder'}
thres = 2
for i, sub in enumerate(a):
a[i] = [x if c[x] >= thres else replacement[type(x)] for x in sub]
如果您想避免重新创建sublists
,就像上面使用列表理解一样,您可以使用@Alfe方式。
thres = 2
for i, sub in enumerate(a):
for j, item in enumerate(sub):
if cC[item] < thres:
sub[i][j] = replacement[type(x)]
使用基于我们要替换的项目类型的replacement
dict
插入替换。
对于记录,上面(thres = 2
)生成
[['number_one', 'number_two', 'integer_placeholder', 'string_placeholder'],
['number_one', 'number_two', 'number_four'],
['number_two', 'number_two', 'number_three', 'number_four', 'integer_placeholder'],
['number_three', 'number_two', 'number_three', 'number_four'],
...]
注意占位符。
如果您希望在分配正确的占位符时更加灵活,可以使用以下内容:
def placeholder_selector(something):
if 'fruits' in something:
return 'fruit_placeholder'
elif ...
return 'string_placeholder'
....
和
for i, sub in enumerate(a):
a[i] = [x if c[x] >= thres else placeholder_selector(x) for x in sub]
答案 1 :(得分:1)
分三步完成:第一步:遍历所有列表并计算哪个元素频繁出现,第二步:过滤掉所有具有低计数的元素,第三步:逐步遍历所有列表并替换列表中的元素低计数元素。
counter = defaultdict(int)
for sublist in sublists:
for element in sublist:
counter[element] += 1
low_count_elements = { element
for element, count in counter.iteritems()
if count < 5 }
for sublist in sublists:
for i in range(len(sublist)):
if sublist[i] in low_count_elements:
sublist[i] = placeholder
我建议用更合适的东西替换变量名sublist
,但缺少上下文,我无法想出更合适的名称。
如果您使用的是Python 3,请使用items()
代替iteritems()
。
关于您对简短,简单,高效的关注:使用Python,我敢于找到更短或更简单解决方案。返回副本而不是内联替换元素将不会更加高效。你要求的任务本质上需要迭代两次。我看不出任何方法来优化它(通过使用更聪明的算法)。您可以使用数据的子样本来查找哪些元素的频率,以防我们讨论大量元素。这可能是一个实用的优化步骤,当然会带来一个小错误。当然,如果你使用像numpy
这样的专用库,你可以更高效,它可以在优化的C而不是显式的Python中执行许多步骤。
答案 2 :(得分:1)
如果您的低频元素很少,那么记住这些的位置可能是一种优化,而不是对所有内容进行两次迭代。然后你只会迭代几个元素。
low_counter = defaultdict(set)
high_set = set()
for i, sublist in enumerate(sublists):
for j, element in enumerate(sublist):
if element not in high_set:
c = len(low_counter[element])
if c + 1 < 5:
low_counter[element].add((i, j))
elif c + 1 == 5:
del low_counter[element]
high_set.add(element)
for i, j in low_counter:
sublists[i][j] = placeholder
但当然,从一定数量的此类元素向上(例如,如果几乎每个元素都要被替换),效率就会降低。
答案 3 :(得分:0)
使用collections.Counter
子类的解决方案:
import collections, itertools
l = [
['number_one', 'number_two', 3, 'number_six', 'fruit_apple'],
['number_one', 'fruit_apple', 'number_two', 'number_four'],
['number_two', 'number_two', 'fruit_apple' 'number_three', 'number_four', 4],
['number_three', 'fruit_apple', 'number_two', 'number_three', 'number_four'],
['number_four', 'fruit_apple', 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9, 'fruit_orange'],
]
limit = 2
counts = collections.Counter(itertools.chain.from_iterable(l))
for sub_l in l:
for k,i in enumerate(sub_l):
sub_l[k] = i if counts[i] > limit else ('integer_placeholder' if isinstance(i,int) else 'stringnumber_placeholder')
print(l)
输出:
[['stringnumber_placeholder', 'number_two', 'integer_placeholder', 'stringnumber_placeholder', 'fruit_apple'], ['stringnumber_placeholder', 'fruit_apple', 'number_two', 'number_four'], ['number_two', 'number_two', 'stringnumber_placeholder', 'number_four', 'integer_placeholder'], ['stringnumber_placeholder', 'fruit_apple', 'number_two', 'stringnumber_placeholder', 'number_four'], ['number_four', 'fruit_apple', 'stringnumber_placeholder', 'stringnumber_placeholder', 'stringnumber_placeholder', 'number_two', 'integer_placeholder', 'stringnumber_placeholder']]