Pandas在特定于列的条件下拆分DataFrame并创建具有拆分原因的列

时间:2015-03-11 17:07:51

标签: python pandas

我正在清理数据并为包含枚举数据的列提供一组有效值。所以我想将数据集拆分为AOK行和包含无效列数据的行。诀窍是具有无效列数据的行需要填充一个特殊列,其中包含对该行无效的列名列表。

例如,给出下表:

      A      B  C   D
 0  foo    one  0   0
 1  bar    one  1   2
 2  foo    two  2   4
 3  bar  three  3   6
 4  foo    two  4   8
 5  bar    two  5  10
 6  foo    one  6  12
 7  foo  three  7  14

限制col A可以是{'foo'},col B可以是{'one','two'}。输出数据帧应如下所示:

有效行:

      A      B  C   D
 0  foo    one  0   0
 2  foo    two  2   4
 4  foo    two  4   8
 6  foo    one  6  12

行无效:

      A      B  C   D  Exception
 1  bar    one  1   2   A
 3  bar  three  3   6   A, B
 5  bar    two  5  10   A

作为一名熊猫新人,我对此进行了如下讨论:

columnBounds = {'A' : {'foo'}, 'B':{'one', 'two'}}
df['exception'] = ''
for columnName, bounds in columnBounds.iteritems():
    idlist = df[~df.columnName.isin(bounds)].index.tolist()
    for ix in idlist:
        if df.loc[ix, 'exception'] == '':
            df.loc[ix, 'exception'] = str(ix)
        else:
            df.loc[ix, 'exception'] += ', {}'.format(str(ix))

baddf = df[df.exception.isin([''])]
gooddf = df[~df.exception.isin([''])]

此代码在很多方面都有错误,但主要是行:

idlist = df[~df.columnName.isin(bounds)].index.tolist()

失败,因为'columnName'的使用在df []的上下文中失败,因为它期望列名的文字值。我如何解决这个问题和/或解决原始问题的“正确”方法是什么?虽然我不清楚如何存储和操作嵌入在pandas单元格中的列表,但是收集列表的方式也存在问题。

谢谢!

4 个答案:

答案 0 :(得分:3)

isin接受字典,这可以大大简化困难部分:

>>> good_dict = {"A": ["foo"], "B": ["one", "two"]}
>>> invalid = ~df[list(good_dict)].isin(good_dict)
>>> df["Exception"] = invalid.apply(lambda x: ','.join(invalid.columns[x]), axis=1)
>>> df
     A      B  C   D Exception
0  foo    one  0   0          
1  bar    one  1   2         A
2  foo    two  2   4          
3  bar  three  3   6       A,B
4  foo    two  4   8          
5  bar    two  5  10         A
6  foo    one  6  12          
7  foo  three  7  14         B

可以轻松拆分:

>>> any_exception = invalid.any(axis=1)
>>> df[any_exception]
     A      B  C   D Exception
1  bar    one  1   2         A
3  bar  three  3   6       A,B
5  bar    two  5  10         A
7  foo  three  7  14         B
>>> df[~any_exception]
     A    B  C   D Exception
0  foo  one  0   0          
2  foo  two  2   4          
4  foo  two  4   8          
6  foo  one  6  12          

我喜欢为我们正在传递的内容添加一个空的例外列,但如果我们需要,我们可以避免这样做。

答案 1 :(得分:2)

这是我怎么做的。第一个程序,然后输出,然后解释。

from pandas import DataFrame
from itertools import compress

# define DataFrame

rows = [
    ["foo", "one", 0, 0],
    ["bar", "one", 1, 2],
    ["foo", "two", 2, 4],
    ["bar", "three", 3, 6],
    ["foo", "two", 4, 8],
    ["bar", "two", 5, 10],
    ["foo", "one", 6, 12],
    ["foo", "three", 7, 14],
]

df = DataFrame(data=rows, columns=list("ABCD"))

print "original DataFrame:"
print df, "\n"

# define what values are permitted in each column
permitted = {
    'A': set(["foo"]),
    'B': set(["one", "two"]),
}

def check_validity(df, permitted):
    """
    Given a DataFrame and a dict of permitted values for
    each column, determine which cells are valid given
    those rules. Amend the DataFrame to note which rows have
    exceptions. Return a second DataFrame that indicates which
    cells were valid.
    """

    # first determine, for each column in the list of rules, what
    # cells are valid / invalid by that rule
    valid_cols = [ colname for colname in df.columns if colname in permitted ]
    valid = DataFrame(columns=valid_cols, index=df.index)
    for colname, permitted_values in permitted.items():
        valid[colname] = df[colname].isin(permitted_values)

    # add an Exception column that for each row, lists just the columns
    # that were found NOT to be valid
    df["Exception"] = [ ', '.join(compress(valid.columns, ~valid.ix[i])) for i in df.index ]
    return valid


valid = check_validity(df, permitted)

print "exceptions noted:"
print df, "\n"

valid_rows = valid["A"] & valid["B"]

# the good kids
print "valid data:"
print df[valid_rows], "\n"

# the problem children
print "not valid:"
print df[~valid_rows], "\n"

收率:

original DataFrame:
     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
3  bar  three  3   6
4  foo    two  4   8
5  bar    two  5  10
6  foo    one  6  12
7  foo  three  7  14

exceptions noted:
     A      B  C   D Exception
0  foo    one  0   0
1  bar    one  1   2         A
2  foo    two  2   4
3  bar  three  3   6      A, B
4  foo    two  4   8
5  bar    two  5  10         A
6  foo    one  6  12
7  foo  three  7  14         B

valid data:
     A    B  C   D Exception
0  foo  one  0   0
2  foo  two  2   4
4  foo  two  4   8
6  foo  one  6  12

not valid:
     A      B  C   D Exception
1  bar    one  1   2         A
3  bar  three  3   6      A, B
5  bar    two  5  10         A
7  foo  three  7  14         B

check_validity功能是操作的关键。它使用isin方法查看每个列,以测试集合成员资格。它构造了第二个DataFramevalid,以记录哪些单元格通过或未通过测试。然后,它使用非常方便的itertools.compress来选择pandas精选选择函数(~valid.ix[rownumber])的列名称来提取"对此无效的项目行"并加入他们。在整个DataFrame中每行收集无效项目列表,我们就会回家。

答案 2 :(得分:1)

~df.columnName.isin(bounds)返回一个布尔值(True,False)。您需要先进行检查,然后添加您的ID。根据您在数据中的读取方式,您可以浏览行并检查异常并添加它们;或者您将它们读入不同的数据帧以获取两个数据帧。

答案 3 :(得分:1)

这是一个有效的解决方案:

import pandas as pd

def set_exception(row, col):
    if row['Exception'] is None:
        row['Exception'] = [col]
    else:
        row['Exception'].append(col)


def f(row, allowed_col_vals):
    for col in row.keys():
        if col in allowed_col_vals:
            if row[col] not in allowed_col_vals[col]:
                set_exception(row, col)
    return row


allowed_col_vals = {
    'A': ['foo'],
    'B': ['one', 'two']
}

df = pd.read_csv('data.csv')
df['Exception'] = None
# apply f to each row of df
df = df.apply(f, axis=1, args=(allowed_col_vals,))
# df['Exception'] is a Series and map applies the function element-wise
valid_rows = df[df['Exception'].map(lambda x: not bool(x))]
invalid_rows = df[df['Exception'].map(bool)]

输出为:

# valid rows:
     A    B  C   D Exception
0  foo  one  0   0      None
2  foo  two  2   4      None
4  foo  two  4   8      None
6  foo  one  6  12      None

# invalid rows:    
     A      B  C   D Exception
1  bar    one  1   2       [A]
3  bar  three  3   6    [A, B]
5  bar    two  5  10       [A]
7  foo  three  7  14       [B]