有效替换列表中的低频元素

时间:2017-07-19 08:41:43

标签: python

我有一个列表如下所示:

[
 ['number_one', 'number_two', 3, 'number_six', 'fruit_apple'],
 ['number_one', 'fruit_apple' 'number_two', 'number_four'],
 ['number_two', 'number_two', 'fruit_apple' 'number_three', 'number_four', 4],
 ['number_three', 'fruit_apple' 'number_two', 'number_three', 'number_four'],
 ['number_four', 'fruit_apple' 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9, 'fruit_orange'],
 ...
]

我想用占位符替换所有低频率元素(比如所有列表中少于2次出现)。例如,如果一个整数元素少于2次出现,它将被integer_placeholder替换,如果以number_开头的字符串元素少于2次出现,它将被{替换为{ {1}}。水果也一样。

预期结果(阈值为2):

stringnumber_placeholder

当然,这可以通过在列表和一些嵌套循环上迭代至少两次来完成。但有没有简短简单高效这样做?

编辑:我的列表包含4881个子列表,每个子列表平均包含992个元素

4 个答案:

答案 0 :(得分:2)

我会使用collections.Counter()来计算实例,然后使用循环来进行替换。像这样:

a = [
 ['number_one', 'number_two', 3, 'number_six'],
 ['number_one', 'number_two', 'number_four'],
 ['number_two', 'number_two', 'number_three', 'number_four', 4],
 ['number_three', 'number_two', 'number_three', 'number_four'],
 ['number_four', 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9]
]

from collections import Counter as cC

c = cC(x for y in a for x in y)  # generator expression to flatten the list of lists

replacement = {int: 'integer_placeholder', str: 'string_placeholder'}

thres = 2
for i, sub in enumerate(a):
    a[i] = [x if c[x] >= thres else replacement[type(x)] for x in sub]

如果您想避免重新创建sublists,就像上面使用列表理解一样,您可以使用@Alfe方式。

thres = 2
for i, sub in enumerate(a):
    for j, item in enumerate(sub):
        if cC[item] < thres:
            sub[i][j] = replacement[type(x)]

使用基于我们要替换的项目类型的replacement dict插入替换。

对于记录,上面(thres = 2)生成

[['number_one', 'number_two', 'integer_placeholder', 'string_placeholder'],
 ['number_one', 'number_two', 'number_four'],
 ['number_two', 'number_two', 'number_three', 'number_four', 'integer_placeholder'],
 ['number_three', 'number_two', 'number_three', 'number_four'],
 ...]

注意占位符。

如果您希望在分配正确的占位符时更加灵活,可以使用以下内容:

def placeholder_selector(something):
    if 'fruits' in something:
        return 'fruit_placeholder'
    elif ...
        return 'string_placeholder'
    ....

for i, sub in enumerate(a):
    a[i] = [x if c[x] >= thres else placeholder_selector(x) for x in sub]

答案 1 :(得分:1)

分三步完成:第一步:遍历所有列表并计算哪个元素频繁出现,第二步:过滤掉所有具有低计数的元素,第三步:逐步遍历所有列表并替换列表中的元素低计数元素。

counter = defaultdict(int)
for sublist in sublists:
  for element in sublist:
    counter[element] += 1
low_count_elements = { element
  for element, count in counter.iteritems()
  if count < 5 }
for sublist in sublists:
  for i in range(len(sublist)):
    if sublist[i] in low_count_elements:
      sublist[i] = placeholder

我建议用更合适的东西替换变量名sublist,但缺少上下文,我无法想出更合适的名称。

如果您使用的是Python 3,请使用items()代替iteritems()

关于您对简短,简单,高效的关注:使用Python,我敢于找到更短更简单解决方案。返回副本而不是内联替换元素将不会更加高效。你要求的任务本质上需要迭代两次。我看不出任何方法来优化它(通过使用更聪明的算法)。您可以使用数据的子样本来查找哪些元素的频率,以防我们讨论大量元素。这可能是一个实用的优化步骤,当然会带来一个小错误。当然,如果你使用像numpy这样的专用库,你可以更高效,它可以在优化的C而不是显式的Python中执行许多步骤。

答案 2 :(得分:1)

如果您的低频元素很少,那么记住这些的位置可能是一种优化,而不是对所有内容进行两次迭代。然后你只会迭代几个元素。

low_counter = defaultdict(set)
high_set = set()
for i, sublist in enumerate(sublists):
  for j, element in enumerate(sublist):
    if element not in high_set:
      c = len(low_counter[element])
      if c + 1 < 5:
        low_counter[element].add((i, j))
      elif c + 1 == 5:
        del low_counter[element]
        high_set.add(element)
for i, j in low_counter:
  sublists[i][j] = placeholder

但当然,从一定数量的此类元素向上(例如,如果几乎每个元素都要被替换),效率就会降低。

答案 3 :(得分:0)

使用collections.Counter子类的解决方案:

import collections, itertools

l = [
 ['number_one', 'number_two', 3, 'number_six', 'fruit_apple'],
 ['number_one', 'fruit_apple', 'number_two', 'number_four'],
 ['number_two', 'number_two', 'fruit_apple' 'number_three', 'number_four', 4],
 ['number_three', 'fruit_apple', 'number_two', 'number_three', 'number_four'],
 ['number_four', 'fruit_apple', 'number_a_two', 'number_a_three', 'number_five', 'number_two', 9, 'fruit_orange'],
]

limit = 2
counts = collections.Counter(itertools.chain.from_iterable(l))
for sub_l in l:
    for k,i in enumerate(sub_l):
        sub_l[k] = i if counts[i] > limit else ('integer_placeholder' if isinstance(i,int) else 'stringnumber_placeholder')

print(l)

输出:

[['stringnumber_placeholder', 'number_two', 'integer_placeholder', 'stringnumber_placeholder', 'fruit_apple'], ['stringnumber_placeholder', 'fruit_apple', 'number_two', 'number_four'], ['number_two', 'number_two', 'stringnumber_placeholder', 'number_four', 'integer_placeholder'], ['stringnumber_placeholder', 'fruit_apple', 'number_two', 'stringnumber_placeholder', 'number_four'], ['number_four', 'fruit_apple', 'stringnumber_placeholder', 'stringnumber_placeholder', 'stringnumber_placeholder', 'number_two', 'integer_placeholder', 'stringnumber_placeholder']]