This is the data
row1| sbkjd nsdnak ABC
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN
I have already tried this using pandas
and got help from stackoverflow. But, I want to be able to remove the duplicates without having to use the pandas
library or any library in general. So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen. How do I do this?
So, my final output should contain this:
[ row1 or row2 + row3 or row4 + row5 ]
答案 0 :(得分:0)
这应该仅从原始表中选择唯一行。如果有两行或更多行共享重复数据,它将选择第一行。
data = [["sbkjd", "nsdnak", "ABC"],
["vknfe", "edcmmi", "ABC"],
["fjnfn", "msmsle", "XYZ"],
["sdkmm", "tuiepd", "XYZ"],
["adjck", "rulsdl", "LMN"]]
def check_list_uniqueness(candidate_row, unique_rows):
for element in candidate_row:
for unique_row in unique_rows:
if element in unique_row:
return False
return True
final_rows = []
for row in data:
if check_list_uniqueness(row, final_rows):
final_rows.append(row)
print(final_rows)
答案 1 :(得分:0)
此Bash命令会执行此操作(假设您的数据位于名为test
的文件中,并且第4列的值未出现在其他列中)
cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done
cut -d ' ' -f 4 test
选择第四列中的数据
tr '\n' ' '
将列变成一行(将换行符转换为空格)
sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g'
删除重复
tr ' ' '\n'
将唯一值行变成一列
while read str; do grep -m 1 $str test; done
读取唯一的单词,并从test
打印与该单词匹配的第一行