使用pandas将缺少值的csv-data读入python

时间:2014-12-01 13:00:02

标签: python csv pandas missing-data

我有一个看起来像这样的CSV文件:

"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",

所有引用的条目都是字符串。非引用条目是数字。空字段缺少值(NaN),引用的空字段仍应视为空字符串。 我试着用pandas read_csv读它,但是我不能按照我希望的方式使它工作......它仍然考虑到,"",和,作为NaN,而它'不适用于第一个。

d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)

有人可以帮忙吗?它有可能吗?

3 个答案:

答案 0 :(得分:1)

您可以尝试使用numpy.genfromtxt并指定missing_values参数

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

答案 1 :(得分:0)

可能是这样的:

import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan

答案 2 :(得分:0)

我找到了一种方法来让它或多或少地起作用。我只是不知道,为什么我需要指定dtype = type(None)才能使它工作...非常欢迎对这段代码的评论!

import re
import pandas as pd
import numpy as np

# clear quoting characters
def filterTheField(s):
    m = re.match(r'^"?(.*)?"$', s.strip())
    if m:
        return m.group(1)
    else:
        return np.nan

file = 'test.csv'

y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})

d = pd.DataFrame(y)

print(d)