Question

我正在尝试衡量列表的冗余率。

让我们假设：

L = [a, a, a, a] => redundancy rate = 1

L = [a, b, c, d] => redundancy rate = 0

L = [a, a, b, b] => redundancy rate = 0.5

我最终找不到一种有意义的方式来做到这一点。

Answer 1

将冗余定义为 1 - num_unique_elements / num_total_elements。我假设你的意思是重复的列表的冗余永远不会正好是 1。例如：

lsts = [[1, 1, 1, 1], [1, 1, 2, 2], [1, 2, 3, 4]]
for lst in lsts:
    redundancy = 1 - len(set(lst)) / len(lst)
    print(redundancy)

# 0.75
# 0.5
# 0.0

Answer 2

由于 Timur Shtatland 的评论，我想出了一个与给出的概念相匹配的程序并对其进行了优化。我要提到的一件事是，它为您的第一个测试用例提供了 0.75 的冗余，这是因为列表中只有 75% 是多余的，而且这似乎就是您的意思（但如果它是，请告诉我不是）。

unique = []

for item in L:
    if item not in unique:
        unique.append(item)

redundancy = 1 - len(unique) / len(L)

编辑：如 Timur 的回答所示，使用 set 来定义 unique 而不是编写 for 循环会更清晰。

Answer 3

虽然输出与问题描述中的值相匹配，但我不太确定这是否是一个有效的度量。也许min比mean更好。

import pandas as pd
l1 = ['a', 'a', 'a', 'a']
l2= ['a', 'b', 'c', 'd']
l3 = ['a', 'a', 'b', 'b']

def f(l):
    s = pd.Series(l)
    ratio = s.value_counts() / len(l)
    redundantContent = s[s.duplicated(keep='first')]
    if not redundantContent.empty:
        return redundantContent.map(ratio).mean()
    else:
        return 0

print("redundancy rate of l1: {}".format(f(l1)))
print("redundancy rate of l2: {}".format(f(l2)))
print("redundancy rate of l3: {}".format(f(l3)))

输出

redundancy rate of l1: 1.0
redundancy rate of l2: 0
redundancy rate of l3: 0.5

Python：计算列表的冗余率

3 个答案: