如何通过进行一些检查从CSV文件中删除重复项?

时间:2018-11-19 16:20:16

标签: python

我有一个类似CSV的

col-1(ID)       col-2(val-List)

1               [1]
1               [1,2,3]
2               [1,2]
2               [1]
3               [10]
3               [10]

我想从该文件中删除重复项,最后,我只需要一行具有更长列表长度的行,例如:

已编辑:

如果那些行的ID和内部列表的长度相同,我想保留一行。

col-1(ID)       col-2(Val-List)

1               [1,2,3]
2               [1,2]
3               [10]

我尝试了很多但没有运气: 我正在尝试使用CSV模块,但不知道如何保持上一个 Val-List 的长度并与下一个匹配ID进行比较。

import csv 
list_1 = []
with open('test123.csv', 'r', encoding='latin-1') as file:
    csvReader = csv.reader(file, delimiter=',')

    for row in csvReader:
        key = (row[0])
        # but how should I use this id to get my desired results?

1 个答案:

答案 0 :(得分:3)

为什么不让pandas做这项工作?

import pandas

# Read in the CSV
df = pandas.read_csv('test123.csv', encoding='latin-1')

# Compute the list lengths
df['lst_len'] = df['col-2(val-List)'].map(lambda x: len(list(x)))

# Sort in reverse order by list lengths
df = df.sort_values('lst_len', ascending=False)

# Drop duplicates, preserving first (longest) list by ID
df = df.drop_duplicates(subset='col-1(ID)')

# Remove extra column that we introduced, write to file
df = df.drop('lst_len', axis=1)
df.to_csv('clean_test123.csv', index=False)