Question

我正在编写一个程序，用于在Excel电子表格中识别特定列（名为“StrId”）中的重复值及其计数。除了找到重复之外，我还需要知道每个值重复多少次。

Excel数据被处理为字典列表（每行一个字典），标题为键，数据为值，如[{'StrId'：1，'ProjId'：358}] [{'StrId'： 2，'ProjId'：984 ......}]等。

我的计划是先在每个字典中识别'StrId'键，将它们放在一个列表中，然后在该列表中创建另一个字典以传递值，并在有超过1个值时分开，计算出现的值不止一次。

这是我的代码。现在，它显示带有第一个值的“KeyError”消息，然后停止。

我很感激任何帮助。提前致谢

from openpyxl import load_workbook
workbook = load_workbook('./fullallreadyconversionxmlclean4.xlsx')
sheet = workbook['Full-All']
headers = ["StrId", "ProjectId", "TweetText", "Label"]

excel_data = []
for row_num, row in enumerate(sheet):
    if row_num is 0:
        continue
    row_data = {}
    for col_num, cell in enumerate(row):
        if col_num > len(headers) - 1:
            continue
        key = headers[col_num]
        value = cell.value
        row_data[key] = value
    excel_data.append(row_data)    


for row in excel_data:
    for key in row:    
        if key is 'StrId':
            value = row[key]
            list_ids = []
            list_ids.append(value)

            dup_dic = {}           
            for  value in list_ids:
                if value in list_ids:
                    dup_dic[value] +=1
                else:
                    dup_dic[value] =1                

                print dup_dic

Answer 1

您可以使用Python的Counter。我假设您的excel_data结构为列表，每个列表包含一个字典，但如果不是这样，请告诉我。

from collections import Counter

excel_data = [
    [{'StrId': 1, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
]

# create a list of all values
flattened_values = [list_dict[0]['StrId'] for list_dict in excel_data]

# pass them to counter to get a dict of value to count
counter = Counter(flattened_values)  # Counter({2: 3, 1: 1})

# use dictionary comprehension to create a dict from this counter with only
# values with count > 1 to find duplicates
repetitions = {
    val: count for val, count in counter.iteritems() if count > 1
}  # {2: 3}

Answer 2

如果子列表可以包含多个dict，则可以使用 itertools.chain 来展平子列表：

from collections import Counter
excel_data = [
    [{'StrId': 1, 'ProjId': 358},{'StrId': 5, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984},{'StrId': 3, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
]

from collections import Counter
from itertools import chain
print(Counter(map(itemgetter("StrId"), chain(*excel_data))))

但是你似乎有一个dicts列表，所以你可以删除链：

from collections import Counter

print(Counter(map(itemgetter("StrId"), excel_data)))

在比较字符串时，永远不要使用是，检查对象的身份，使用==即if key == 'StrId'但是这样做会更有意义查找即value = row["StrId"]。同时为变量提供更好的名称，row对于 dict 来说不是一个非常好的名称。

Answer 3

这是一个可能的解决方案：

from collections import defaultdict

excel_data = [
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 1, 'ProjId': 358},
]

output = defaultdict(int)

for row in excel_data:
    if 'StrId' in row:
        output[row['StrId']] += 1

print output

如果您对上述代码有疑问，请查看collections.defaultdict

使用字典计算列表中的重复（重复）

3 个答案: