我目前面临一个问题,我需要将以下图像中显示的所有数据仅合并为一行。
因此,我尝试使用Python和Openpyxl编写一个解析脚本,该脚本读取行并仅在值非空或不相同时才复制到新工作簿中。
我超出范围错误,并且代码不仅仅保留我想要的数据。我已经花了多个小时,所以我想在这里问一下我是否会卡住。
我已经阅读了一些有关Openpyxl的文档,以及有关在python中创建列表的信息,在youtube上尝试了一些视频,但是它们都没有完全达到我的目的。
import openpyxl
from openpyxl import Workbook
path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active
listab = []
rows = ws.max_row
columns = ws.max_column
for i in range (1, rows+1):
listab.append([])
cellValue = " "
prevCell = " "
for c in range (1, rows+1):
for r in range(1, columns+1):
cellValue = ws.cell(row=r, column=c).value
if cellValue == prevCell:
listab[r-1].append(prevCell)
elif cellValue == "NULL":
listab[r-1].append(prevCell)
elif cellValue != prevCell:
listab[r-1].append(cellValue)
prevCell = cellValue
for r in range(1, rows+1):
for c in range (1, columns+1):
j = ws2.cell(row = r, column=c)
j.value = listab[r-1][c-1]
print(listab)
wb2.save("output.xlsx")
应该包含以下信息的一行:
ods_service_id | service_name | service_plan_name | CPU |内存网卡|驱动器|
答案 0 :(得分:1)
我个人会选择pandas
。
import pandas as pd
#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True) ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()
#Storing pd into a list
records_list = df_data.to_dict('records')
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}
#Go through unique ids
for key in unique_ids:
processed[key] = {}
#Get related records
matching_records = [y for y in records_list if y['ods_service_ids'] == key]
#Loop through records
for record in matching_records:
#For each key to check, save in dict if non null
processed[key]['ods_service_ids'] = key
for detail_key in keys_to_check:
if record[detail_key] != "NO DATA" :
processed[key][detail_key] = record[detail_key]
##Note : doesn't handle duplicate values for different keys so far
#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]
#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)
上面的方法应该可以工作,但是我不确定如何保存相同ID的重复记录。您是否希望将它们存储为DRIVE_0
,DRIVE_1
,DRIVE_2
?
df可以以其他方式导出。在#export to Excel
下面替换为以下内容:
df.to_excel("output.xlsx",sheet_name='Sheet_name_1')
没有输入数据,很难看到任何流量。使用伪造数据更正了上面的代码
答案 1 :(得分:1)
说实话,我认为您已经对数据结构感到困惑,并且提出了比您需要的复杂得多的东西。
一种合适的方法是为每个服务使用Python字典,逐行更新它们。
wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
if row[0] not in objs:
obj = {key:value for key, value in zip(headers, row)}
objs[obj['ods_service_id']] = obj
else:# update dict with non-None values
extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
obj.update(extra)
# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
ws2.append([obj[key] for key in headers])
请注意如何在不使用计数器的情况下进行所有操作。
答案 2 :(得分:-1)
我建议为此使用pandas库,然后您可以轻松地进行任何形式的转换。
import pandas as pd
exceldata = pd.read_excel('tmp.xlsx', index_col=0)
print(exceldata)
您可以轻松删除null/na value
,也可以替换并将其导出为excel格式。
参考帮助: