新手Python程序员在这里。我知道有很多与此相关的SO帖子,但我所评论的解决方案似乎都不符合我的问题。
我有可变数量的csv文件,所有文件都有相同的列数。第四列的标题将随每个csv文件而变化(它是朱利安日期)。顺便提及,该第四列存储来自卫星传感器的表面温度。举个例子:
UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010
我想保留前四列(最好是来自我文件列表中的第一个csv文件),然后将所有剩余csv文件中的第四列加入/合并到第一个表中。决赛桌看起来像这样:
UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339
我知道Pandas软件包可能是一种更简单的方法,但我不想在这个工具中需要第三方软件包(要求用户使用easy_install,PIP等)。我也意识到我在RDBMS中会更简单,但同样,我也不希望这是一个要求。所以我只使用csv模块。
我想我知道如何做到这一点,我假设我应该将合并的行写入新的csv文件。我已经从第一个csv文件中取出标头,然后循环遍历每个后续的csv文件,将新的列名称添加到标题行。我要说明的是,除了第一个csv文件中的行之外,如何从第四列写入值。所有csv文件都有UID列,应该匹配。
def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
acq_year = mosaic_io_array[0][0]
out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
# get first csv file in the list of files
first_file = acq_date_list[0][1]
# open and read the first csv file
with open(first_file, 'rb') as first_csv:
r1 = csv.reader(first_csv, delimeter = ',')
header1 = next(r1)
allrows1 = []
row1 = next(r1)
allrows1.append(row1)
# open and write to the new csv
with open(out_file, 'wb') as out_csv:
w = csv.writer(out_csv, delimeter = ',')
# loop through the list of remaining csv files
for acq_date in acq_date_list[1:]: # skip the first csv file
# open and read other csv files
with open(acq_date[1], 'rb') as other_csv:
rX = csv.reader(other_csv, delimeter = ',')
headerX = next(rX)
header_row = '%s,%s' % (header1, headerX)
# write header and subsequent merged rows to new csv file?
也许在之后:
headerX = next(rX)
我可以将标题行拆分成一个列表,并拉出第四项吗?这也适用于"其他" csv文件。或者这通常是错误的做法?
更新2/26/2016 实际上我只是让Gijs解决了部分工作的问题。迭代添加标题列,但不添加行的其余值。我仍然不确定如何用剩余的csv文件中的值填充空单元格。
Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,
答案 0 :(得分:1)
循环浏览文件,跟踪存在哪些密钥,并使用csv.DictWriter
和csv.DictReader
写下所有记录。
import csv
records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
with open(fn) as f:
reader = csv.DictReader(f)
all_keys.update(set(reader.fieldnames))
for r in reader:
records.append(r)
with open("table_merged.csv", "wb") as f:
writer = csv.DictWriter(f, fieldnames = all_keys)
writer.writeheader()
for r in records:
writer.writerow(r)
这将写一个空的'单元格'对于没有列的记录。
将您的文件作为第一个和第二个.csv
,在第二种情况下,最后一列重命名为002
而不是001
,您将得到以下结果:
UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33
如果要按特定顺序保留列,则必须使all_keys
成为list
,然后仅添加新文件中不在{{1}中的列}。
all_keys
答案 1 :(得分:0)
尝试pandas方法:
import pandas as pd
file_list = ['1.csv','2.csv','3.csv']
df = pd.read_csv(file_list[0])
for f in file_list[1:]:
# use only 1-st and 4-th columns ...
tmp = pd.read_csv(f, usecols=[0, 3])
df = pd.merge(df, tmp, on='UID')
df.to_csv('output.csv', index=False)
print(df)
输出:
UID Latitude Longitude 001 007 015
0 1 -151.01 45.20 13121 11111 11
1 2 -151.13 45.16 15009 22222 12
2 3 -151.02 45.09 10067 33333 13
3 4 -151.33 45.03 14010 44444 14
output.csv
UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14