解析看起来像csv的数据但不是吗?

时间:2016-11-18 16:43:21

标签: python pandas

我正在尝试解析一个看起来像csv文件的文件但不是。它用逗号分隔,但每个逗号后面都有一个空格。此外,没有标题,行长也不同。

这是一个例子,如果我以.txt格式打开文件,我得到这样的结果:

FUD, speed, time, heading, offsets
MUD, speed, time, heading, offsets, error
CLA, head, time, speed, offset, error, errorfix
MUD, speed, time, heading, offsets, error
MUD, speed, time, heading, offsets, error
FUD, speed, time, heading, offsets
CLA, head, time, speed, offset, error, errorfix
CLA, head, time, speed, offset, error, errorfix
(note head, time, offset and all those after the first column are all values.)

现在我试过了。

import pandas as pd

df =pd.read_csv('data.csv', headers = None)
MUD = df[df[0]=='MUD'].values.tolist()

然而,我收到了这个错误

CParserError: Error tokenizing data. C error: Expected 10 fields in line 3, saw 18

当我在谷歌搜索错误时,建议我使用

error_bad_lines=False

然而,这给了我一个错误:

expected 10 fields, saw 15.

我正在尝试制作我看到的每个MUD实例的熊猫列表,所以稍后我可以做这样的事情:

newMUD = MUD[4]/100

最终我会有这样的事情:

print (MUD)
MUD, 12, 1, 5, 1, 1
MUD, 13, 2, 3, 2, 0
MUD, 12, 3, 5, -2, 0
MUD, 4, 4, 3, -3, 1

我的数据样本

NKF1, 447526092, -3.08, 0.01, 175.83, -0.02133949, 0.03264881, -0.06251871, 0, -28.93325, 26.49632, -0.1290034, 0.07, -0.02, 0.14
NKF2, 447526092, -26, 0.00, 0.00, 0.00, 0.00, 0.00, 255, 55, 341, 0, 0, 0, 0
NKF3, 447526092, -0.01, 0.06, 0.12, -0.04, -0.08, -0.03, 0, 0, 0, -0.73, 0.00
NKF4, 447526092, 0.03, 0.01, 0.00, 0.00, 0.00, 0.0002261061, 0, 0, 0, 16, 9023, 0, 1
NKF5, 447526092, 0, 0, 0, 0, 1.14, 0.88, 0.00, 0.00, 0.50, 0.003602755, 0.01431285, 0.02802294
NKF6, 447526092, -2.66, -0.98, 187.53, -0.06789517, -0.2714562, -0.1189714, 0, -28.96132, 26.25431, -0.2784806, 0.00, 0.36, -0.49
NKF7, 447526092, 21, 0.00, 0.00, 0.00, 0.00, 0.00, 258, 55, 338, 0, 0, 0, 0
NKF8, 447526092, -0.04, -0.20, 0.07, -0.04, -0.23, -0.17, 0, 0, 0, 10.83, 0.00
NKF9, 447526092, 0.04, 0.03, 0.01, 0.12, 0.00, 0.000866859, 0, 0, 0, 16, 9023, 0, 1
AHR2, 447526241, -3.12, -0.42, 176.43, 418.84, 34.3167522, -118.4068499
POS, 447526306, 34.3167515, -118.406853, 419.03, 0.2784806
IMU, 447545009, -0.09418038, 0.1740572, -0.05483108, 0.6083156, 0.2225795, -9.380787, 0, 0, 52.99446, 1, 1
IMU2, 447545009, -0.09127176, 0.1908958, -0.06220703, 0.524766, 0.3107446, -8.754621, 0, 0, 56.125, 1, 1
SONR, 447545584, 0, 0, 0, 0
RFND, 447545593, 0.00, 0.00
IMU, 447565482, -0.08753563, 0.1228692, -0.04508965, 0.6137247, -0.01505011, -9.579732, 0, 0, 53.0831, 1, 1
IMU2, 447565482, -0.08944235, 0.139776, -0.05096832, 0.4677677, 0.03778861, -9.214079, 0, 0, 55.875, 1, 1
GPS, 447565911, 4, 246769200, 1920, 14, 0.70, 34.3167523, -118.4068497, 418.91, 0.05656854, 135, -0.16, 1
GPA, 447565911, 1.11, 0.73, 1.04, 0.29, 1, 447565
SONR, 447566084, 0, 0, 0, 0
RFND, 447566093, 0.00, 0.00
ATT, 447566114, 0.00, -2.88, 0.00, -0.62, 0.00, 187.41, 0.02, 0.01
PIDR, 447566125, 0, 0, 0, 0, 0, 0
PIDP, 447566135, 0, 0, 0, 0, 0, 0
PIDY, 447566145, 0, 0, 0, 0, 0, 0
PIDS, 447566155, 0, 0, 0, 0, 0, 0
NKF1, 447566164, -3.30, 0.35, 175.70, -0.02778457, 0.03493549, -0.04115778, 0, -28.9337, 26.49665, -0.1338468, 0.07, -0.02, 0.14
NKF2, 447566164, -26, 0.00, 0.00, 0.00, 0.00, 0.00, 255, 55, 341, 0, 0, 0, 0
NKF3, 447566164, -0.01, 0.06, 0.12, -0.04, -0.08, -0.11, 0, 0, 0, -0.73, 0.00
NKF4, 447566164, 0.03, 0.01, 0.00, 0.00, 0.00, 0.0002256641, 0, 0, 0, 16, 9023, 0, 1
NKF5, 447566164, 0, 0, 0, 0, 1.14, 0.88, 0.00, 0.00, 0.50, 0.003267812, 0.01763795, 0.02970827
NKF6, 447566164, -2.88, -0.62, 187.40, -0.07544779, -0.2697962, -0.09678251, 0, -28.96231, 26.2515, -0.2831134, 0.00, 0.36, -0.49
NKF7, 447566164, 21, 0.00, 0.00, 0.00, 0.00, 0.00, 258, 55, 338, 0, 0, 0, 0
NKF8, 447566164, -0.04, -0.20, 0.07, -0.04, -0.23, -0.25, 0, 0, 0, 10.83, 0.00
NKF9, 447566164, 0.04, 0.03, 0.01, 0.12, 0.00, 0.00086712, 0, 0, 0, 16, 9023, 0, 1
AHR2, 447566373, -3.34, -0.07, 176.32, 418.84, 34.3167522, -118.4068497
POS, 447566396, 34.3167515, -118.406853, 419.04, 0.2831134
IMU, 447587271, -0.08603665, 0.071096, -0.03380377, 0.5931511, -0.07432687, -9.615693, 0, 0, 53.0831, 1, 1
IMU2, 447587271, -0.08848803, 0.09229023, -0.04071644, 0.4688947, 0.01987415, -9.166938, 0, 0, 56.125, 1, 1
MAG, 447587700, -265, -77, 332, -115, 0, 1, 0, 0, 0, 1, 447587691
MAG2, 447587700, -273, -29, 372, 77, -135, 38, 0, 0, 0, 1, 447587693
ARSP, 447587748, 2.969838, 4.424126, 38.22, -4.424126, 110.8502, 1
BARO, 447587789, -0.09136668, 97036.14, 55.03, -0.8952343, 447587, 0
CURR, 447587949, 16.91083, 0.6012492, 60.22538

2 个答案:

答案 0 :(得分:0)

如果你真的想对列进行计算,那么使用Pandas是有意义的(没有从问题中得到)。在这种情况下,它足以传递预期的列名,这样解析器就不会对列数量的变化感到惊讶:

# Creates a list ["note", "head" ... ]
columns = "note head time speed offset error errorfix".split()

df = pd.read_csv(filename, names=columns)

MUD = df.query("note == 'MUD'")

MUD["speed"] / 4

答案 1 :(得分:0)

您可以在使用from_records创建数据框时过滤行。在这里,我使用csv模块创建行并丢弃不需要的行。

import pandas as pd
import csv

def data_reader(filename, rowname):
    with open(filename, newline='') as fp:
        yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
            if row[0] == rowname)

df = pd.DataFrame.from_records(data_reader('testfile', 'MUD'))
print(df)

这有点冒险 - 如果非MUD线不符合标准csv规则,读者可能会引发错误。这是一个更复杂的版本,它将csv解析器限制为MUD行

import pandas as pd
import csv

def mud_reader(filename, rowname):
    rowname = rowname + ", "
    with open(filename, newline='') as fp:
        yield from (row[len(rowname):] for row in csv.reader(
            (line for line in fp if line.startswith(rowname)),
            skipinitialspace=True))

df = pd.DataFrame.from_records(mud_reader('testfile', 'MUD'))
print(df)