在csv中查找重复项和重复项的唯一性

时间:2019-12-03 19:41:23

标签: python linux bash csv awk

我需要创建一个脚本,该脚本会将csv(有时标记为.inf)加载到内存中,并评估数据是否为重复类型。 CSV本身在每个字段中始终具有不同的信息,但列将相同。大约100列左右。在我的示例中,为了便于阅读,我将其缩小到10列。

我要查找的重复项的“类型”有点奇怪。我需要首先在列2中找到所有重复项。然后,我需要查看那组重复项,并查看列8(在我的实际csv中,它将是列84)。 查看第8列,我只需要输出以下数据:

A。在第2列中重复

B。在第8列中唯一

第2列可能只有2个重复项,而它们的第8列是相同的。我不需要看。如果第2列有3个重复项,并且它们的第8列,2列相同,并且1个是唯一的,则我需要查看所有3个FULL行。

Desired input
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

此数据将不断变化,甚至第8列中的字符数也可能会变化。

Desired output
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left 

您可以从所需的输出中看到,我不需要查看带有837iec的行,因为尽管它们的第2列是重复的,但第8列却彼此匹配。我不需要看。对于382ork之类的东西,列8中的2个匹配,而其中一个是唯一的。我需要查看全部3。

我将在Unix系统上使用它,而我希望使用的方法是键入“ ./scriptname filename.csv”,并且输出可以是标准输出,或者如果需要的话,可以输出到日志文件中。 / p>

我一直无法找到一种方法来执行此操作,因为我需要比较第8列使我感到困惑。任何帮助将不胜感激。

我在另一个线程中发现了这一点,该线程至少使我获得了第2列重复项的完整行。我以为我不完全了解它是如何工作的。

#!/usr/bin/awk -f
{
    lines[$1][NR] = $0;
}
END {
    for (vehid in lines) {
        if (length(lines[vehid]) > 1) {
            for (lineno in lines[vehid]) {
                # Print duplicate line for decision purposes
                print lines[vehid][lineno];
                # Alternative: print line number and line
                #print lineno, lines[vehid][lineno];
            }
        }
    }
}

我所面临的问题是它没有考虑下一栏。它还不能很好地处理空白列。我的csv将有100列,其中50列可能完全空白。

3 个答案:

答案 0 :(得分:4)

请您尝试以下。

awk '
BEGIN{
  FS=","
}
FNR==NR{
  a[$2]++
  b[$2,$8]++
  c[$2]=(c[$2]?c[$2] ORS:"")$0
  next
}
a[$2]>1 && b[$2,$8]==1{
  print c[$2]
  delete a[$2]
}' <(sort -t',' -k2 Input_file) <(sort -t',' -k2 Input_file)

显示的示例输出如下。

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

说明: :添加了上述代码的详细说明。

awk '                                                     ##Starting awk program from here.
BEGIN{                                                    ##Starting BEGIN section from here.
  FS=","                                                  ##Setting FS as comma here.
}                                                         ##Closing BEGIN section here.
FNR==NR{                                                  ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  a[$2]++                                                 ##Creating an array named a whose index is $2 and increment its value with 1 each time it comes here.
  b[$2,$8]++                                              ##Creating an array named b whose index is $2,$8 and increment its value with 1 each time it comes here.
  c[$2]=(c[$2]?c[$2] ORS:"")$0                            ##Creating an array named c whose index is $2 and value will be keep concatenating its same indexs whole line value.
  next                                                    ##next will skip all further statements from here.
}                                                         ##Closing BLOCK for FNR==NR condition here.
a[$2]>1 && b[$2,$8]==1{                                   ##Checking condition if array a with index $2 value is greater than 1 AND array b with index $2,$8 value is 1.
  print c[$2]                                             ##Then print array c value with $2 here.
  delete a[$2]                                            ##Deleting array a value with $2 here which will make sure NO DUPLICATE lines are getting printed.
}' <(sort -t',' -k2 file) <(sort -t',' -k2 file)          ##Sending Input_files in sorted format from 2nd field to make sure all values are coming together before doing operations on it.

答案 1 :(得分:2)

可以通过Python解决此问题(在这里我将_idqty用于两个捕获的字段):

import csv
from collections import defaultdict

f = open('f1.txt', 'r')
d = defaultdict(lambda: defaultdict(list))

csv_reader = csv.reader(f)

for row in csv_reader:
    _id = row[1]
    qty = row[7]
    d[_id][qty].append(row)

f.close()

for _id in d:
    for qty in d[_id]:
        # if there are more than 1 'qty'
        # OR there is only 1 'qty' and there is only 1 line in the values
        # for the array (row) (allows a record with only 1 line)
        if len(d[_id]) > 1 or len(d[_id][qty]) == 1:
            for row in d[_id][qty]:
                print(','.join(row))

打印:

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

答案 2 :(得分:1)

如果可以使用熊猫,可以这样做:

import pandas as pd
e = pd.read_csv('out16.txt', header=None)
e.columns = list(range(1,11))
e.drop_duplicates(subset=[2,8]).set_index(1).to_csv('out_test.txt', header=False) 
e = e.drop_duplicates(subset=[2,8]).set_index(1)
e

输出:

       2         3          4      5      6      7        8     9     10
1        
m  123veh  john;doe  10/1/2019  ryzen  split  32929  38757ace turn  left                                                               
m  123veh  john;doe  10/1/2019  ryzen  split  32929   495842  turn  left
m  837iec  john;doe  10/1/2019  ryzen  split  32929    12345  turn  left
m  382ork  john;doe  10/1/2019  ryzen  split  32929    38757  turn  left
m  382ork  john;doe  10/1/2019  ryzen  split  32929  4978d87  turn  left