我正在编写一个程序,用于在Excel电子表格中识别特定列(名为“StrId”)中的重复值及其计数。除了找到重复之外,我还需要知道每个值重复多少次。
Excel数据被处理为字典列表(每行一个字典),标题为键,数据为值,如[{'StrId':1,'ProjId':358}] [{'StrId': 2,'ProjId':984 ......}]等。
我的计划是先在每个字典中识别'StrId'键,将它们放在一个列表中,然后在该列表中创建另一个字典以传递值,并在有超过1个值时分开,计算出现的值不止一次。
这是我的代码。现在,它显示带有第一个值的“KeyError”消息,然后停止。
我很感激任何帮助。提前致谢
from openpyxl import load_workbook
workbook = load_workbook('./fullallreadyconversionxmlclean4.xlsx')
sheet = workbook['Full-All']
headers = ["StrId", "ProjectId", "TweetText", "Label"]
excel_data = []
for row_num, row in enumerate(sheet):
if row_num is 0:
continue
row_data = {}
for col_num, cell in enumerate(row):
if col_num > len(headers) - 1:
continue
key = headers[col_num]
value = cell.value
row_data[key] = value
excel_data.append(row_data)
for row in excel_data:
for key in row:
if key is 'StrId':
value = row[key]
list_ids = []
list_ids.append(value)
dup_dic = {}
for value in list_ids:
if value in list_ids:
dup_dic[value] +=1
else:
dup_dic[value] =1
print dup_dic
答案 0 :(得分:1)
您可以使用Python的Counter
。我假设您的excel_data
结构为列表,每个列表包含一个字典,但如果不是这样,请告诉我。
from collections import Counter
excel_data = [
[{'StrId': 1, 'ProjId': 358}],
[{'StrId': 2, 'ProjId': 984}],
[{'StrId': 2, 'ProjId': 984}],
[{'StrId': 2, 'ProjId': 984}],
]
# create a list of all values
flattened_values = [list_dict[0]['StrId'] for list_dict in excel_data]
# pass them to counter to get a dict of value to count
counter = Counter(flattened_values) # Counter({2: 3, 1: 1})
# use dictionary comprehension to create a dict from this counter with only
# values with count > 1 to find duplicates
repetitions = {
val: count for val, count in counter.iteritems() if count > 1
} # {2: 3}
答案 1 :(得分:1)
如果子列表可以包含多个dict,则可以使用 itertools.chain 来展平子列表:
from collections import Counter
excel_data = [
[{'StrId': 1, 'ProjId': 358},{'StrId': 5, 'ProjId': 358}],
[{'StrId': 2, 'ProjId': 984},{'StrId': 3, 'ProjId': 358}],
[{'StrId': 2, 'ProjId': 984}],
[{'StrId': 2, 'ProjId': 984}],
]
from collections import Counter
from itertools import chain
print(Counter(map(itemgetter("StrId"), chain(*excel_data))))
但是你似乎有一个dicts列表,所以你可以删除链:
from collections import Counter
print(Counter(map(itemgetter("StrId"), excel_data)))
在比较字符串时,永远不要使用是,检查对象的身份,使用==
即if key == 'StrId'
但是这样做会更有意义查找即value = row["StrId"]
。同时为变量提供更好的名称,row
对于 dict 来说不是一个非常好的名称。
答案 2 :(得分:0)
这是一个可能的解决方案:
from collections import defaultdict
excel_data = [
{'StrId': 2, 'ProjId': 984},
{'StrId': 2, 'ProjId': 984},
{'StrId': 2, 'ProjId': 984},
{'StrId': 2, 'ProjId': 984},
{'StrId': 1, 'ProjId': 358},
{'StrId': 1, 'ProjId': 358},
{'StrId': 1, 'ProjId': 358},
{'StrId': 2, 'ProjId': 984},
{'StrId': 1, 'ProjId': 358},
]
output = defaultdict(int)
for row in excel_data:
if 'StrId' in row:
output[row['StrId']] += 1
print output
如果您对上述代码有疑问,请查看collections.defaultdict