Question

我正在清理数据并为包含枚举数据的列提供一组有效值。所以我想将数据集拆分为AOK行和包含无效列数据的行。诀窍是具有无效列数据的行需要填充一个特殊列，其中包含对该行无效的列名列表。

例如，给出下表：

      A      B  C   D
 0  foo    one  0   0
 1  bar    one  1   2
 2  foo    two  2   4
 3  bar  three  3   6
 4  foo    two  4   8
 5  bar    two  5  10
 6  foo    one  6  12
 7  foo  three  7  14

限制col A可以是{'foo'}，col B可以是{'one'，'two'}。输出数据帧应如下所示：

有效行：

      A      B  C   D
 0  foo    one  0   0
 2  foo    two  2   4
 4  foo    two  4   8
 6  foo    one  6  12

行无效：

      A      B  C   D  Exception
 1  bar    one  1   2   A
 3  bar  three  3   6   A, B
 5  bar    two  5  10   A

作为一名熊猫新人，我对此进行了如下讨论：

columnBounds = {'A' : {'foo'}, 'B':{'one', 'two'}}
df['exception'] = ''
for columnName, bounds in columnBounds.iteritems():
    idlist = df[~df.columnName.isin(bounds)].index.tolist()
    for ix in idlist:
        if df.loc[ix, 'exception'] == '':
            df.loc[ix, 'exception'] = str(ix)
        else:
            df.loc[ix, 'exception'] += ', {}'.format(str(ix))

baddf = df[df.exception.isin([''])]
gooddf = df[~df.exception.isin([''])]

此代码在很多方面都有错误，但主要是行：

idlist = df[~df.columnName.isin(bounds)].index.tolist()

失败，因为'columnName'的使用在df []的上下文中失败，因为它期望列名的文字值。我如何解决这个问题和/或解决原始问题的“正确”方法是什么？虽然我不清楚如何存储和操作嵌入在pandas单元格中的列表，但是收集列表的方式也存在问题。

谢谢！

Answer 1

isin接受字典，这可以大大简化困难部分：

>>> good_dict = {"A": ["foo"], "B": ["one", "two"]}
>>> invalid = ~df[list(good_dict)].isin(good_dict)
>>> df["Exception"] = invalid.apply(lambda x: ','.join(invalid.columns[x]), axis=1)
>>> df
     A      B  C   D Exception
0  foo    one  0   0          
1  bar    one  1   2         A
2  foo    two  2   4          
3  bar  three  3   6       A,B
4  foo    two  4   8          
5  bar    two  5  10         A
6  foo    one  6  12          
7  foo  three  7  14         B

可以轻松拆分：

>>> any_exception = invalid.any(axis=1)
>>> df[any_exception]
     A      B  C   D Exception
1  bar    one  1   2         A
3  bar  three  3   6       A,B
5  bar    two  5  10         A
7  foo  three  7  14         B
>>> df[~any_exception]
     A    B  C   D Exception
0  foo  one  0   0          
2  foo  two  2   4          
4  foo  two  4   8          
6  foo  one  6  12

我喜欢为我们正在传递的内容添加一个空的例外列，但如果我们需要，我们可以避免这样做。

Answer 2

这是我怎么做的。第一个程序，然后输出，然后解释。

from pandas import DataFrame
from itertools import compress

# define DataFrame

rows = [
    ["foo", "one", 0, 0],
    ["bar", "one", 1, 2],
    ["foo", "two", 2, 4],
    ["bar", "three", 3, 6],
    ["foo", "two", 4, 8],
    ["bar", "two", 5, 10],
    ["foo", "one", 6, 12],
    ["foo", "three", 7, 14],
]

df = DataFrame(data=rows, columns=list("ABCD"))

print "original DataFrame:"
print df, "\n"

# define what values are permitted in each column
permitted = {
    'A': set(["foo"]),
    'B': set(["one", "two"]),
}

def check_validity(df, permitted):
    """
    Given a DataFrame and a dict of permitted values for
    each column, determine which cells are valid given
    those rules. Amend the DataFrame to note which rows have
    exceptions. Return a second DataFrame that indicates which
    cells were valid.
    """

    # first determine, for each column in the list of rules, what
    # cells are valid / invalid by that rule
    valid_cols = [ colname for colname in df.columns if colname in permitted ]
    valid = DataFrame(columns=valid_cols, index=df.index)
    for colname, permitted_values in permitted.items():
        valid[colname] = df[colname].isin(permitted_values)

    # add an Exception column that for each row, lists just the columns
    # that were found NOT to be valid
    df["Exception"] = [ ', '.join(compress(valid.columns, ~valid.ix[i])) for i in df.index ]
    return valid


valid = check_validity(df, permitted)

print "exceptions noted:"
print df, "\n"

valid_rows = valid["A"] & valid["B"]

# the good kids
print "valid data:"
print df[valid_rows], "\n"

# the problem children
print "not valid:"
print df[~valid_rows], "\n"

收率：

original DataFrame:
     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
3  bar  three  3   6
4  foo    two  4   8
5  bar    two  5  10
6  foo    one  6  12
7  foo  three  7  14

exceptions noted:
     A      B  C   D Exception
0  foo    one  0   0
1  bar    one  1   2         A
2  foo    two  2   4
3  bar  three  3   6      A, B
4  foo    two  4   8
5  bar    two  5  10         A
6  foo    one  6  12
7  foo  three  7  14         B

valid data:
     A    B  C   D Exception
0  foo  one  0   0
2  foo  two  2   4
4  foo  two  4   8
6  foo  one  6  12

not valid:
     A      B  C   D Exception
1  bar    one  1   2         A
3  bar  three  3   6      A, B
5  bar    two  5  10         A
7  foo  three  7  14         B

check_validity功能是操作的关键。它使用isin方法查看每个列，以测试集合成员资格。它构造了第二个DataFrame，valid，以记录哪些单元格通过或未通过测试。然后，它使用非常方便的itertools.compress来选择pandas精选选择函数（~valid.ix[rownumber]）的列名称来提取＆＃34;对此无效的项目行＆＃34;并加入他们。在整个DataFrame中每行收集无效项目列表，我们就会回家。

Answer 3

~df.columnName.isin(bounds)返回一个布尔值（True，False）。您需要先进行检查，然后添加您的ID。根据您在数据中的读取方式，您可以浏览行并检查异常并添加它们;或者您将它们读入不同的数据帧以获取两个数据帧。

Answer 4

这是一个有效的解决方案：

import pandas as pd

def set_exception(row, col):
    if row['Exception'] is None:
        row['Exception'] = [col]
    else:
        row['Exception'].append(col)


def f(row, allowed_col_vals):
    for col in row.keys():
        if col in allowed_col_vals:
            if row[col] not in allowed_col_vals[col]:
                set_exception(row, col)
    return row


allowed_col_vals = {
    'A': ['foo'],
    'B': ['one', 'two']
}

df = pd.read_csv('data.csv')
df['Exception'] = None
# apply f to each row of df
df = df.apply(f, axis=1, args=(allowed_col_vals,))
# df['Exception'] is a Series and map applies the function element-wise
valid_rows = df[df['Exception'].map(lambda x: not bool(x))]
invalid_rows = df[df['Exception'].map(bool)]

输出为：

# valid rows:
     A    B  C   D Exception
0  foo  one  0   0      None
2  foo  two  2   4      None
4  foo  two  4   8      None
6  foo  one  6  12      None

# invalid rows:    
     A      B  C   D Exception
1  bar    one  1   2       [A]
3  bar  three  3   6    [A, B]
5  bar    two  5  10       [A]
7  foo  three  7  14       [B]

Pandas在特定于列的条件下拆分DataFrame并创建具有拆分原因的列

4 个答案: