如何删除两个CSV之间的不常见行?

时间:2018-05-26 20:33:29

标签: python csv

我有2个CSV(从Android设备(加速度计,陀螺仪)导入),问题是加速度计的CSV有一些额外的时间戳(本身就很奇怪)不包含在陀螺仪的CSV中,所以有什么方法可以找到和删除不常见的行(时间戳)

此问题的原因: 我正在通过使用来自Android设备的传感器数据进行活动识别,因此我对陀螺仪和加速度计数据的时间戳进行一对一的对应非常重要

1 个答案:

答案 0 :(得分:2)

创建演示数据:

#  bad data, the weird ones are bad

data = """
ts1,d001,d002,d003
ts2,d001,d002,d003
ts3,d001,d002,d003
weird1,d001,d002,d003
weird2,d001,d002,d003
ts4,d001,d002,d003 
"""

# the good data

other = """
ts1,f001,f002,f003
ts2,f001,f002,f003
ts3,f001,f002,f003
ts4,f001,f002,f003 
"""

# create demo files
fn1 = "d1.csv"
fn2 = "d2.csv"
with open(fn1,"w") as f:
    f.write(data)
with open(fn2,"w") as f:
    f.write(other)

现在解析:

import csv

def readFile(name):
    """returns a dict for data with 4 columns"""
    result = []
    with open(name,"r") as f:
        k = csv.DictReader(f,fieldnames=["ts","dp1","dp2","dp3"])
        for l in k:
            result.append(l)
    return result

badData = readFile(fn1)
goodData = readFile(fn2)

print(badData)
print(goodData)

输出:

 # weired data
 [{'dp3': 'd003', 'ts': 'ts1', 'dp1': 'd001', 'dp2': 'd002'}, 
  {'dp3': 'd003', 'ts': 'ts2', 'dp1': 'd001', 'dp2': 'd002'}, 
  {'dp3': 'd003', 'ts': 'ts3', 'dp1': 'd001', 'dp2': 'd002'}, 
  {'dp3': 'd003', 'ts': 'weird1', 'dp1': 'd001', 'dp2': 'd002'}, 
  {'dp3': 'd003', 'ts': 'weird2', 'dp1': 'd001', 'dp2': 'd002'}, 
  {'dp3': 'd003 ', 'ts': 'ts4', 'dp1': 'd001', 'dp2': 'd002'}]

# good data
[{'dp3': 'f003', 'ts': 'ts1', 'dp1': 'f001', 'dp2': 'f002'}, 
 {'dp3': 'f003', 'ts': 'ts2', 'dp1': 'f001', 'dp2': 'f002'}, 
 {'dp3': 'f003', 'ts': 'ts3', 'dp1': 'f001', 'dp2': 'f002'}, 
 {'dp3': 'f003 ', 'ts': 'ts4', 'dp1': 'f001', 'dp2': 'f002'}]

现在要消除不良数据点:

# get all the "good" ts 
goodTs = set( oneDict["ts"] for oneDict in goodData)

# clean the bad data, only keep those "ts" that are in goodTs
cleanedData = [x for x in badData if x["ts"] in goodTs]

print(cleanedData)

输出:

# filtered weired data
[{'dp3': 'd003', 'ts': 'ts1', 'dp1': 'd001', 'dp2': 'd002'}, 
 {'dp3': 'd003', 'ts': 'ts2', 'dp1': 'd001', 'dp2': 'd002'}, 
 {'dp3': 'd003', 'ts': 'ts3', 'dp1': 'd001', 'dp2': 'd002'}, 
 {'dp3': 'd003 ', 'ts': 'ts4', 'dp1': 'd001', 'dp2': 'd002'}]

完成。