Python:从csv文件中读取相同的行 - 逻辑

时间:2017-02-20 23:07:49

标签: python

我在csv文件中为缺少的行追加数据时遇到问题:我正在为每个客户从csv文件中读取行,并在列表中附加行所拥有的数据。每个客户都需要具有相同的ID,并在示例图像中以绿色突出显示。如果下一个客户没有所有需要id的行,我仍然需要将0值附加到这些缺失行的列表中。因此,以黄色突出显示的客户需要在数据列表中添加与绿色相同的值。

我正在尝试阅读每一行并将其ID与我创建的所有可能ID的列表进行比较,但我总是卡在第一个ID上并且不确定这是否是正确的方法再次读取前一行,直到它的id等于列表中的id为可能的id(我这样做是为了将缺少的行添加到列表中)。如果您有任何建议,请告诉我?

注意: 如果仅考虑带有ID的列,对于这两个客户,我希望列表看起来像这样:{{1 }}。所以我正在寻找一种方法 - 一旦我在第409行以黄色显示 - 首先附加第一个需要的id 410,然后只有409等等。同样 - 在末尾追加两个缺失的ID:403,402。

代码: def write_data(工作簿):     [...]

list_with_ids = [410, 409, 408, 407, 406, 405, 403, 402, **410, 409, 408, 407, 406, 405, 403, 402**]

2 个答案:

答案 0 :(得分:0)

使用包含随机数据列的以下数据输入,考虑使用列表推导和循环的以下数据争论:

输入数据

# Cust  ID      Data1        Data2  Data3        Data4  Data5
# 2011  62,404  0.269101238  KPT    0.438881697  UAX    0.963170513
# 2011  62,405  0.142397746  XYD    0.51668728   PTQ    0.761695425
# 2011  62,406  0.782342616  QCN    0.259141256  FNX    0.870971924
# 2011  62,407  0.221750017  EIU    0.358439487  MAN    0.13633062
# 2011  62,408  0.097509568  CRU    0.410058705  BFK    0.680228327
# 2011  62,409  0.322871333  LAC    0.489425167  GUX    0.449476844
# 919   62,403  0.371461633  PUR    0.626146074  KWX    0.525711736
# 919   62,404  0.384859932  AJZ    0.223408599  JSU    0.914916663
# 919   62,405  0.020630503  SFY    0.260778598  VUU    0.213559498
# 919   62,406  0.952425138  EBI    0.59595738   ZYU    0.283794413
# 919   62,407  0.410368534  BTT    0.252698401  FFY    0.41080646
# 919   62,408  0.553390336  GMA    0.846309022  BIN    0.049852419
# 919   62,409  0.193437955  NBB    0.877311494  XQX    0.080656637

Python 代码

import csv

i = 0
data = []
# READ CSV AND CAPTURE HEADERS AND DATA
with open('Input.csv', 'r') as f:       
    rdr = csv.reader(f)    
    for line in rdr:
        if i == 0:
            headers = line
        else:
            line[1] = int(line[1].replace(',',''))
            data.append(line)
        i += 1

# CREATE NEEDED LISTS
cust_list = list(set([i[0] for i in data]))
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]

# CAPTURE MISSING IDS BY CUSTOMER
for c in cust_list:
    currlist = [d[1] for d in data if d[0] == c]
    missingids = [i for i in id_list if i not in currlist]
    for m in missingids:
        data.append([c, m,'','','','',''])

# WRITE DATA TO NEW CSV IN SORTED ORDER
with open('Output.csv', 'w') as f:
    wtr = csv.writer(f, lineterminator='\n')
    wtr.writerow(headers)
    for c in cust_list:
        for i in sorted(id_list, reverse=True):
            for d in data:  
                if d[0] == c and d[1] == i:
                    wtr.writerow(d)

输出数据

Output Data

答案 1 :(得分:0)

甚至考虑Python第三方模块,例如数据分析包pandas;甚至是使用pyodbc的SQL解决方案,因为Windows的内置Jet / ACE SQL引擎可以直接查询CSV文件。

您将注意到下面和之前的解决方案,需要进行相当多的处理以删除ID列中的千位逗号分隔符,因为模块首先将它们视为字符串。如果从原始csv文件中删除此类逗号,则可以减少代码行。

Pandas (左侧合并两个数据框)

import pandas as pd

df = pd.read_csv('Input.csv')

cust_list = df['Cust'].unique()
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]

ids = pd.DataFrame({'Cust': [int(c) for i in id_list for c in cust_list],
                    'ID': [int(i) for i in id_list for c in cust_list]})

df['ID']  = df['ID'].str.replace(',','').astype(int)

df = ids.merge(df, on=['Cust', 'ID'], how='left').\
               sort_values(['Cust', 'ID'], ascending=[True, False])

df.to_csv('Output_pandas.csv', index=False)

PyODBC (仅适用于在两个csv文件上使用左连接的Windows计算机)

import pyodbc

conn = pyodbc.connect(r'Driver=Microsoft Access Text Driver (*.txt, *.csv);' + \
                        'DBQ=C:\Path\To\CSV\Files;Extensions=asc,csv,tab,txt;',  
                      autocommit=True)
cur = conn.cursor()
cust_list = [i[0] for i in cur.execute("SELECT DISTINCT c.Cust FROM Input.csv c")]
id_list = [62402,62403,62404,62405,62406,62407,62408,62409,62410]
cur.close()

with open('ID_list.csv', 'w') as f:
    wtr = csv.writer(f, lineterminator='\n')
    wtr.writerow(['Cust', 'ID'])
    for item in [[int(c),int(i)] for c in cust_list for i in id_list]:
        wtr.writerow(item)    
i = 0
with open('Input.csv', 'r') as f1, open('Input_without_commas.csv', 'w') as f2:
    rdr = csv.reader(f1); wtr = csv.writer(f2, lineterminator='\n')    
    for line in rdr:
        if i > 0:
            line[1] = int(line[1].replace(',',''))
        wtr.writerow(line)
        i += 1 

strSQL = "SELECT i.Cust, i.ID, c.Data1, c.Data2, c.Data3, c.Data4, c.Data5 " +\
         " FROM ID_list.csv i" +\
         " LEFT JOIN Input_without_commas.csv c" +\
         " ON  i.Cust = c.Cust AND i.ID = c.ID" +\
         " ORDER BY i.Cust, i.ID DESC"

cur = conn.cursor()
with open('Output_sql.csv', 'w') as f:
    wtr = csv.writer(f, lineterminator='\n')
    wtr.writerow(['Cust', 'ID', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5'])    
    for i in cur.execute(strSQL):        
        wtr.writerow(i)

cur.close()
conn.close()

输出 (对于上述两种解决方案)

CSV Output