我们再来一次。
嗨,我正试图检测CSV文件中的错误。
该文件应如下所示
goodfile.csv
"COL_A","COL_B","COL_C","COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
但我的文件实际上是
brokenfile.csv
"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
当我用pandas
导入两个文件时 data = pd.read_csv('goodfile.csv')
data = pd.read_csv('brokenfile.csv')
我得到了相同的结果
data
COL_A COL_B COL_C COL_D
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD
无论如何,我想要的是检测第二个文件中的错误" brokenfile.csv"目前缺乏""标题COL_C
之间答案 0 :(得分:1)
我认为您可以在"
的{{1}}和str.contains
的列中检测到DataFrame
boolean array
的{{1}}倒置~
p>
import pandas as pd
import io
temp=u'''"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), quoting = 3)
print df
"COL_A" "COL_B" COL C "COL_D"
0 "ROW1COLA" "ROW1COLB" "ROW1COLC" "ROW1COLD"
1 "ROW2COLA" "ROW2COLB" "ROW2COLC" "ROW2COLD"
2 "ROW3COLA" "ROW3COLB" "ROW3COLC" "ROW3COLD"
3 "ROW4COLA" "ROW4COLB" "ROW4COLC" "ROW4COLD"
4 "ROW5COLA" "ROW5COLB" "ROW5COLC" "ROW5COLD"
5 "ROW6COLA" "ROW6COLB" "ROW6COLC" "ROW6COLD"
6 "ROW7COLA" "ROW7COLB" "ROW7COLC" "ROW7COLD"
print df.columns
Index([u'"COL_A"', u'"COL_B"', u'COL C', u'"COL_D"'], dtype='object')
print df.columns.str.contains('"')
[ True True False True]
print ~df.columns.str.contains('"')
[False False True False]
print df.columns[~df.columns.str.contains('"')]
Index([u'COL C'], dtype='object')
答案 1 :(得分:0)
Pandas试图在读取数据时聪明地识别数据类型。这正是您所描述的情况中发生的事情,COL_C
和"COL_C"
都被真正解析为字符串。
简而言之,没有错误检测!在这种情况下,至少大熊猫不会产生错误。
你可以做什么,如果你想在标题中检测到缺失的引号,你可以尝试在更多的传统"中读取第一行。 pythonic方式并从那里得出你自己的结论:
>>> with open('filename') as f:
lines = f.readlines()
....