输入:
我有两个csv文件(file1.csv和file2.csv)。
file1看起来像:
ID,Name,Gender
1,Smith,M
2,John,M
file2看起来像:
name,gender,city,id
问题:
我想将file1的头与file2进行比较,并复制匹配列的数据。在找到file2中的匹配列之前,file1中的标头需要为小写。
输出:
输出应该是这样的:
name,gender,city,id # name,gender,and id are the only matching columns btw file1 and file2
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
到目前为止,我已尝试过以下代码:
import csv
file1 = csv.DictReader(open("file1.csv")) #reading file1.csv
file1_Dict = {} # the dictionary of lists that will store the keys and values as list
for row in file1:
for column, value in row.iteritems():
file1_Dict.setdefault(column,[]).append(value)
for key in file1_Dict: # converting the keys of the dictionary to lowercase
file1_Dict[key.lower()] = file1_Dict.pop(key)
file2 = open("file2.csv") #reading file2.csv
file2_Dict ={} # store the keys into a dictionary with empty values
for row2 in file2:
row2 = row2.split(",")
for i in row2:
file2_Dict[i] = ""
知道如何解决这个问题吗?
答案 0 :(得分:1)
你不需要Python。这是SQL的任务。
SQLite Browser支持CSV导入。按照以下步骤获得所需的输出:
现在,您可以决定如何匹配数据集。如果您只想匹配ID上的文件,那么您可以执行以下操作:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id
如果您想匹配每一列,您可以执行以下操作:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id and f1.name = f2.name and f1.gender = f2.gender
最后,只需将查询结果导出回CSV。
我花了很多时间尝试用脚本语言执行这样的任务。使用SQL的好处是,您只需告诉您要匹配的内容,然后让数据库为您进行优化。通常,它最终会比我编写的任何代码更快地进行匹配。
如果您有兴趣,python还有一个开箱即用的sqlite模块。由于上述原因,我倾向于使用它作为python脚本中的数据源,我只是在运行python脚本之前导入SQLite浏览器中所需的CSV。
答案 1 :(得分:1)
我在使用python的情况下解决了这个问题而没有考虑性能。我花了很长一段时间,p !,
这是我的解决方案。
import csv
csv_data1_filepath = './file1.csv'
csv_data2_filepath = './file2.csv'
def main():
# import nem config and data into memory
data1 = list(csv.reader(file(csv_data1_filepath, 'r')))
data2 = list(csv.reader(file(csv_data2_filepath, 'r')))
file1_header = data1[0][:] # Get f1 header
file2_header = data2[0][:] # Get f1 header
lowered_file1_header = [item.lower() for item in file1_header] # lowercase it
lowered_file2_header = [item.lower() for item in file2_header] # do it for header 2 anyway
col_index_dict = {}
for column in lowered_file1_header:
if column in file2_header:
col_index_dict[column] = lowered_file1_header.index(column)
else:
col_index_dict[column] = -1 # mark as column that will not be worked on later
for column in lowered_file2_header:
if not column in lowered_file1_header:
col_index_dict[column] = -1 # mark as column that will not be worked on later
# build header
output = [col_index_dict.keys()]
is_header = True
for row in data1:
if is_header is False:
rowData = []
for column in col_index_dict:
column_index = col_index_dict[column]
if column_index != -1:
rowData.append(row[column_index])
else:
rowData.append('')
output.append(rowData)
else:
is_header = False
print(output)
if __name__ == '__main__':
main()
这将为您提供输出:
[
['gender', 'city', 'id', 'name'],
['M', '', '1', 'Smith'],
['M', '', '2', 'John']
]
请注意,输出类型丢失了它的排序,但这应该可以通过使用有序字典来修复。
希望这有帮助。