我在csv文件中为缺少的行追加数据时遇到问题:我正在为每个客户从csv文件中读取行,并在列表中附加行所拥有的数据。每个客户都需要具有相同的ID,并在示例图像中以绿色突出显示。如果下一个客户没有所有需要id的行,我仍然需要将0值附加到这些缺失行的列表中。因此,以黄色突出显示的客户需要在数据列表中添加与绿色相同的值。
我正在尝试阅读每一行并将其ID与我创建的所有可能ID的列表进行比较,但我总是卡在第一个ID上并且不确定这是否是正确的方法再次读取前一行,直到它的id等于列表中的id为可能的id(我这样做是为了将缺少的行添加到列表中)。如果您有任何建议,请告诉我?
注意: 如果仅考虑带有ID的列,对于这两个客户,我希望列表看起来像这样:{{1 }}。所以我正在寻找一种方法 - 一旦我在第409行以黄色显示 - 首先附加第一个需要的id 410,然后只有409等等。同样 - 在末尾追加两个缺失的ID:403,402。
代码: def write_data(工作簿): [...]
list_with_ids = [410, 409, 408, 407, 406, 405, 403, 402, **410, 409, 408, 407, 406, 405, 403, 402**]
答案 0 :(得分:0)
使用包含随机数据列的以下数据输入,考虑使用列表推导和循环的以下数据争论:
输入数据
# Cust ID Data1 Data2 Data3 Data4 Data5
# 2011 62,404 0.269101238 KPT 0.438881697 UAX 0.963170513
# 2011 62,405 0.142397746 XYD 0.51668728 PTQ 0.761695425
# 2011 62,406 0.782342616 QCN 0.259141256 FNX 0.870971924
# 2011 62,407 0.221750017 EIU 0.358439487 MAN 0.13633062
# 2011 62,408 0.097509568 CRU 0.410058705 BFK 0.680228327
# 2011 62,409 0.322871333 LAC 0.489425167 GUX 0.449476844
# 919 62,403 0.371461633 PUR 0.626146074 KWX 0.525711736
# 919 62,404 0.384859932 AJZ 0.223408599 JSU 0.914916663
# 919 62,405 0.020630503 SFY 0.260778598 VUU 0.213559498
# 919 62,406 0.952425138 EBI 0.59595738 ZYU 0.283794413
# 919 62,407 0.410368534 BTT 0.252698401 FFY 0.41080646
# 919 62,408 0.553390336 GMA 0.846309022 BIN 0.049852419
# 919 62,409 0.193437955 NBB 0.877311494 XQX 0.080656637
Python 代码
import csv
i = 0
data = []
# READ CSV AND CAPTURE HEADERS AND DATA
with open('Input.csv', 'r') as f:
rdr = csv.reader(f)
for line in rdr:
if i == 0:
headers = line
else:
line[1] = int(line[1].replace(',',''))
data.append(line)
i += 1
# CREATE NEEDED LISTS
cust_list = list(set([i[0] for i in data]))
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]
# CAPTURE MISSING IDS BY CUSTOMER
for c in cust_list:
currlist = [d[1] for d in data if d[0] == c]
missingids = [i for i in id_list if i not in currlist]
for m in missingids:
data.append([c, m,'','','','',''])
# WRITE DATA TO NEW CSV IN SORTED ORDER
with open('Output.csv', 'w') as f:
wtr = csv.writer(f, lineterminator='\n')
wtr.writerow(headers)
for c in cust_list:
for i in sorted(id_list, reverse=True):
for d in data:
if d[0] == c and d[1] == i:
wtr.writerow(d)
输出数据
答案 1 :(得分:0)
甚至考虑Python第三方模块,例如数据分析包pandas;甚至是使用pyodbc的SQL解决方案,因为Windows的内置Jet / ACE SQL引擎可以直接查询CSV文件。
您将注意到下面和之前的解决方案,需要进行相当多的处理以删除ID
列中的千位逗号分隔符,因为模块首先将它们视为字符串。如果从原始csv文件中删除此类逗号,则可以减少代码行。
Pandas (左侧合并两个数据框)
import pandas as pd
df = pd.read_csv('Input.csv')
cust_list = df['Cust'].unique()
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]
ids = pd.DataFrame({'Cust': [int(c) for i in id_list for c in cust_list],
'ID': [int(i) for i in id_list for c in cust_list]})
df['ID'] = df['ID'].str.replace(',','').astype(int)
df = ids.merge(df, on=['Cust', 'ID'], how='left').\
sort_values(['Cust', 'ID'], ascending=[True, False])
df.to_csv('Output_pandas.csv', index=False)
PyODBC (仅适用于在两个csv文件上使用左连接的Windows计算机)
import pyodbc
conn = pyodbc.connect(r'Driver=Microsoft Access Text Driver (*.txt, *.csv);' + \
'DBQ=C:\Path\To\CSV\Files;Extensions=asc,csv,tab,txt;',
autocommit=True)
cur = conn.cursor()
cust_list = [i[0] for i in cur.execute("SELECT DISTINCT c.Cust FROM Input.csv c")]
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]
cur.close()
with open('ID_list.csv', 'w') as f:
wtr = csv.writer(f, lineterminator='\n')
wtr.writerow(['Cust', 'ID'])
for item in [[int(c),int(i)] for c in cust_list for i in id_list]:
wtr.writerow(item)
i = 0
with open('Input.csv', 'r') as f1, open('Input_without_commas.csv', 'w') as f2:
rdr = csv.reader(f1); wtr = csv.writer(f2, lineterminator='\n')
for line in rdr:
if i > 0:
line[1] = int(line[1].replace(',',''))
wtr.writerow(line)
i += 1
strSQL = "SELECT i.Cust, i.ID, c.Data1, c.Data2, c.Data3, c.Data4, c.Data5 " +\
" FROM ID_list.csv i" +\
" LEFT JOIN Input_without_commas.csv c" +\
" ON i.Cust = c.Cust AND i.ID = c.ID" +\
" ORDER BY i.Cust, i.ID DESC"
cur = conn.cursor()
with open('Output_sql.csv', 'w') as f:
wtr = csv.writer(f, lineterminator='\n')
wtr.writerow(['Cust', 'ID', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5'])
for i in cur.execute(strSQL):
wtr.writerow(i)
cur.close()
conn.close()
输出 (对于上述两种解决方案)