How to remove duplicates without pandas?

时间:2019-02-24 03:16:13

标签: python

This is the data

row1| sbkjd nsdnak ABC 
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN

I have already tried this using pandas and got help from stackoverflow. But, I want to be able to remove the duplicates without having to use the pandas library or any library in general. So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen. How do I do this? So, my final output should contain this:

[ row1 or row2 + row3 or row4 + row5 ]

2 个答案:

答案 0 :(得分:0)

这应该仅从原始表中选择唯一行。如果有两行或更多行共享重复数据,它将选择第一行。

data = [["sbkjd", "nsdnak", "ABC"],
        ["vknfe", "edcmmi", "ABC"],
        ["fjnfn", "msmsle", "XYZ"],
        ["sdkmm", "tuiepd", "XYZ"],
        ["adjck", "rulsdl", "LMN"]]

def check_list_uniqueness(candidate_row, unique_rows):
    for element in candidate_row:
        for unique_row in unique_rows:
            if element in unique_row:
                return False
    return True

final_rows = []
for row in data:
    if check_list_uniqueness(row, final_rows):
        final_rows.append(row)

print(final_rows)

答案 1 :(得分:0)

此Bash命令会执行此操作(假设您的数据位于名为test的文件中,并且第4列的值未出现在其他列中)

cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done

cut -d ' ' -f 4 test选择第四列中的数据
tr '\n' ' '将列变成一行(将换行符转换为空格)
sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g'删除重复
tr ' ' '\n'将唯一值行变成一列
while read str; do grep -m 1 $str test; done读取唯一的单词,并从test打印与该单词匹配的第一行