我对编程很新,并希望对此程序进行编码,以便在file1.csv
和file2.csv
输入:
file1.csv
看起来像这样:
ID,Nickname,Gender,SubjectPrefix,SubjectFirstName,Whatever1A,Whaterver2A,SubjectLastName
1,J.,M,Dr.,Jason,,,Allan
2,B.,M,Mr.,Brian,,,Welch
file2.csv
看起来像这样:
nickname,gender,city,id,prefix_name,first_name,Whatever1B,last_name,Whatever2B,Whatever3B,Whatever4B
问题:
如何比较file1.csv
和file1.csv
的标头,以识别然后在它们之间传输“公共”列。 “常用”列是具有相似命名约定的列(即ID
和id
, Nickname
和nickname
),或者那些不一定具有相同命名约定但存储相同数据的那些(即SubjectPrefix
和prefix_name
, SubjectFirstName
和first_name
)。
输出:
输出应该是这样的。
注意:已转移的列"id"
,"nickname"
和"gender"
是file1.csv
和{file2.csv
之间具有相似命名的列{1}}标题。列"prefix_name"
和"first_name"
分别对应"SubjectPrefix"
和"SubjectFirstName"
。
id,nickname,gender,prefix_name,first_name,last_name
1,J.,M,Dr.,Jason,Allan
2,B.,M,Mr.,Brian,Welch
我试过这段代码:
import csv
import collections
csv_file1 = "file1.csv"
csv_file2 = "file2.csv"
data1 = list(csv.reader(file(csv_file1,'r')))
data2 = list(csv.reader(file(csv_file2,'r')))
file1_header = data1[0][:] #get the header from file1
file2_header = data2[0][:] #get the header from file2
lowered_file1_header = [item.lower() for item in file1_header] #lowercase file1 header
lowered_file2_header = [item.lower() for item in file2_header] #lowercase file2 header anyways
col_index_dict = {}
for column in lowered_file1_header:
if column == "subjectprefix": # identify "subjectprefix" column in file1.csv
col_index_dict[column] = lowered_file1_header.index(column)
elif column == "subjectfirstname": # identify "subjectfirstname" column in file1.csv
col_index_dict[column] = lowered_file1_header.index(column)
elif column in file2_header: # identify the columns with same naming
col_index_dict[column] = lowered_file1_header.index(column)
else:
col_index_dict[column] = -1 # mark the not matching columns
# Build header
output = [col_index_dict.keys()]
is_header = True
for row in data1:
if is_header is False:
rowData = []
for column in col_index_dict:
column_index = col_index_dict[column]
if column_index != -1:
rowData.append(row[column_index])
else:
rowData.append('')
output.append(rowData)
else:
is_header = False
print(output)
知道如何解决这个问题吗?
答案 0 :(得分:1)
欢迎编程。让我向您介绍令人惊叹的pandas library。
在我的头顶,这里有一些可以解决你的问题。 (我不是说效率很高!所以对于大型数据集来说这可能是一个问题)
import pandas as pd
df = pd.read_csv('file1.csv')
df2 = pd.read_Csv('file2.csv')
df_columns = set(list(df.columns))
df2_columns = set(list(df2.columns))
common_columns = list(df_columns.intersection(df2_columns))
common_df = df[common_columns]
common_df2 = df2[common_colmns]
## At this point you have the common columns for both CSV's. if you want
## to make them into one, just use df concatenate / append. else, you can save both of them like this:
common_df.to_csv('common1.csv')
common_df2.to_csv('common2.csv')
答案 1 :(得分:-1)
感谢Wboy您的贡献,您的意见非常有用。
我能够使用Pandas库找到问题的解决方案。这是代码:
import pandas as pd
# read the csv files
df = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# lowercase the headers
df.columns = df.columns.str.lower()
df2.columns = df2.columns.str.lower()
df_columns = set(list(df.columns))
df2_columns = set(list(df2.columns))
识别并转移“常见”列:
for col in list(df_columns):
for col2 in list(df2_columns):
if col == "subjectprefix" and col2 =="prefix_name":
# copy the data from df["subjectprefix"] column to df2["prefix_name"] column in df2 dataframe
df2["prefix_name"] = df['subjectprefix']
df3 = [col2]
elif col == "subjectfirstname" and col2 =="first_name":
# copy the data from "subjectfirstname" column to "first_name" column
df2["first_name"] = df["subjectfirstname"]
df3.append(col2)
elif col =="subjectlastname" and col2 =="last_name":
#copy the data from "subjectfirstname" column to "last_name" column
df2["last_name"] = df["subjectlastname"]
df3.append(col2)
elif col == col2:
# copy the exactly matching to df2
df2[col2] = df[col]
df3.append(col2)
从数据框df2中删除“不常见”列:
for col2 in list(df2_columns):
if not col2 in df3:
del df2[col2]
# print the output
df2.set_index("id",inplace=True)
print df2
将输出另存为.csv文件:
df2.to_csv('output.csv')
我确信这不是最佳解决方案,我希望在识别和传输“常用”列方面可以改进代码。我的代码已经填满了if / elif语句,我相信在这里必须有更好的方法来实现。