我正在清理数据并为包含枚举数据的列提供一组有效值。所以我想将数据集拆分为AOK行和包含无效列数据的行。诀窍是具有无效列数据的行需要填充一个特殊列,其中包含对该行无效的列名列表。
例如,给出下表:
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
限制col A可以是{'foo'},col B可以是{'one','two'}。输出数据帧应如下所示:
有效行:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
行无效:
A B C D Exception
1 bar one 1 2 A
3 bar three 3 6 A, B
5 bar two 5 10 A
作为一名熊猫新人,我对此进行了如下讨论:
columnBounds = {'A' : {'foo'}, 'B':{'one', 'two'}}
df['exception'] = ''
for columnName, bounds in columnBounds.iteritems():
idlist = df[~df.columnName.isin(bounds)].index.tolist()
for ix in idlist:
if df.loc[ix, 'exception'] == '':
df.loc[ix, 'exception'] = str(ix)
else:
df.loc[ix, 'exception'] += ', {}'.format(str(ix))
baddf = df[df.exception.isin([''])]
gooddf = df[~df.exception.isin([''])]
此代码在很多方面都有错误,但主要是行:
idlist = df[~df.columnName.isin(bounds)].index.tolist()
失败,因为'columnName'的使用在df []的上下文中失败,因为它期望列名的文字值。我如何解决这个问题和/或解决原始问题的“正确”方法是什么?虽然我不清楚如何存储和操作嵌入在pandas单元格中的列表,但是收集列表的方式也存在问题。
谢谢!
答案 0 :(得分:3)
isin
接受字典,这可以大大简化困难部分:
>>> good_dict = {"A": ["foo"], "B": ["one", "two"]}
>>> invalid = ~df[list(good_dict)].isin(good_dict)
>>> df["Exception"] = invalid.apply(lambda x: ','.join(invalid.columns[x]), axis=1)
>>> df
A B C D Exception
0 foo one 0 0
1 bar one 1 2 A
2 foo two 2 4
3 bar three 3 6 A,B
4 foo two 4 8
5 bar two 5 10 A
6 foo one 6 12
7 foo three 7 14 B
可以轻松拆分:
>>> any_exception = invalid.any(axis=1)
>>> df[any_exception]
A B C D Exception
1 bar one 1 2 A
3 bar three 3 6 A,B
5 bar two 5 10 A
7 foo three 7 14 B
>>> df[~any_exception]
A B C D Exception
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
我喜欢为我们正在传递的内容添加一个空的例外列,但如果我们需要,我们可以避免这样做。
答案 1 :(得分:2)
这是我怎么做的。第一个程序,然后输出,然后解释。
from pandas import DataFrame
from itertools import compress
# define DataFrame
rows = [
["foo", "one", 0, 0],
["bar", "one", 1, 2],
["foo", "two", 2, 4],
["bar", "three", 3, 6],
["foo", "two", 4, 8],
["bar", "two", 5, 10],
["foo", "one", 6, 12],
["foo", "three", 7, 14],
]
df = DataFrame(data=rows, columns=list("ABCD"))
print "original DataFrame:"
print df, "\n"
# define what values are permitted in each column
permitted = {
'A': set(["foo"]),
'B': set(["one", "two"]),
}
def check_validity(df, permitted):
"""
Given a DataFrame and a dict of permitted values for
each column, determine which cells are valid given
those rules. Amend the DataFrame to note which rows have
exceptions. Return a second DataFrame that indicates which
cells were valid.
"""
# first determine, for each column in the list of rules, what
# cells are valid / invalid by that rule
valid_cols = [ colname for colname in df.columns if colname in permitted ]
valid = DataFrame(columns=valid_cols, index=df.index)
for colname, permitted_values in permitted.items():
valid[colname] = df[colname].isin(permitted_values)
# add an Exception column that for each row, lists just the columns
# that were found NOT to be valid
df["Exception"] = [ ', '.join(compress(valid.columns, ~valid.ix[i])) for i in df.index ]
return valid
valid = check_validity(df, permitted)
print "exceptions noted:"
print df, "\n"
valid_rows = valid["A"] & valid["B"]
# the good kids
print "valid data:"
print df[valid_rows], "\n"
# the problem children
print "not valid:"
print df[~valid_rows], "\n"
收率:
original DataFrame:
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
exceptions noted:
A B C D Exception
0 foo one 0 0
1 bar one 1 2 A
2 foo two 2 4
3 bar three 3 6 A, B
4 foo two 4 8
5 bar two 5 10 A
6 foo one 6 12
7 foo three 7 14 B
valid data:
A B C D Exception
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
not valid:
A B C D Exception
1 bar one 1 2 A
3 bar three 3 6 A, B
5 bar two 5 10 A
7 foo three 7 14 B
check_validity
功能是操作的关键。它使用isin
方法查看每个列,以测试集合成员资格。它构造了第二个DataFrame
,valid
,以记录哪些单元格通过或未通过测试。然后,它使用非常方便的itertools.compress
来选择pandas
精选选择函数(~valid.ix[rownumber]
)的列名称来提取"对此无效的项目行"并加入他们。在整个DataFrame
中每行收集无效项目列表,我们就会回家。
答案 2 :(得分:1)
~df.columnName.isin(bounds)
返回一个布尔值(True,False)。您需要先进行检查,然后添加您的ID。根据您在数据中的读取方式,您可以浏览行并检查异常并添加它们;或者您将它们读入不同的数据帧以获取两个数据帧。
答案 3 :(得分:1)
这是一个有效的解决方案:
import pandas as pd
def set_exception(row, col):
if row['Exception'] is None:
row['Exception'] = [col]
else:
row['Exception'].append(col)
def f(row, allowed_col_vals):
for col in row.keys():
if col in allowed_col_vals:
if row[col] not in allowed_col_vals[col]:
set_exception(row, col)
return row
allowed_col_vals = {
'A': ['foo'],
'B': ['one', 'two']
}
df = pd.read_csv('data.csv')
df['Exception'] = None
# apply f to each row of df
df = df.apply(f, axis=1, args=(allowed_col_vals,))
# df['Exception'] is a Series and map applies the function element-wise
valid_rows = df[df['Exception'].map(lambda x: not bool(x))]
invalid_rows = df[df['Exception'].map(bool)]
输出为:
# valid rows:
A B C D Exception
0 foo one 0 0 None
2 foo two 2 4 None
4 foo two 4 8 None
6 foo one 6 12 None
# invalid rows:
A B C D Exception
1 bar one 1 2 [A]
3 bar three 3 6 [A, B]
5 bar two 5 10 [A]
7 foo three 7 14 [B]