Question

我有一个大型数据框，其中包含（挪威语）社会安全号码。可以通过特殊算法从这个数字中获取出生日期。然而，不时有一个非法的社会安全号码侵入数据库，破坏了计算。

我想做的是标记每个拥有非法社会安全号码的行，以及显示错误的日志消息。

考虑以下构造的示例

import pandas as pd
from datetime import date

sample_data = pd.DataFrame({'id' : [1, 2, 3], \
                            'sec_num' : [19790116, 19480631, 19861220]})


# The actual algorithm transforming the sec number is more complicated
# this is just for illustration purposes
def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        raise ValueError("Value:{0} not a legal date.".format(argdate))

我想创建以下输出

   id   sec_num date_of_birth  is_in_error                    error_msg
0   1  19790116    1979-01-16        False  
1   2  19480631          None         True 19480631 is not a legal date         
2   3  19861220    1986-12-20        False

我试过了

try:
    sample_data['date_of_birth'] = [int2date(sec_num) for \
                   sec_num in sample_data['sec_num']]
    sample_data['is_in_error'] = False
    sample_data['error_msg'] = ''
 except ValueError as e:
    sample_data['is_in_error'] = True
    sample_data['error_msg'] = str(e)

但这会产生这个

   id   sec_num  is_in_error                         error_msg
0   1  19790116         True  Value:19480631 not a legal date.
1   2  19480631         True  Value:19480631 not a legal date.
2   3  19861220         True  Value:19480631 not a legal date.

我想问题是我在一次操作中分配date_of_birth - 列，在另一次操作中分配错误。我不确定如何模拟捕捉和创建is_in_error和error_msg列。

Answer 1

这是因为您填充数据框的方式。

sample_data['error_msg'] = str(e)

实际上会用str(e)覆盖整个列。

这可能是最有效的方法：

def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError as e:
        pass # you could write the row and the error to your logs here

df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()

但是，如果您还想将错误写入数据帧，则可以使用此方法，尽管速度可能要慢得多（可能会有更快的解决方案）。

df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
    try:
        date_of_birth = int2date(row['sec_num'])
        df.set_value(i, 'date_of_birth', date_of_birth)
    except ValueError as e:
        df.set_value(i, 'is_in_error', True)
        df.set_value(i, 'error_msg', str(e))

它分别处理每一行，只会将错误写入正确的索引，而不是更新整个列。

Answer 2

您处于处理大数据的领域。抛出循环中的异常通常不是最好的想法，因为它通常会中止循环。和其他许多人一样，你似乎并不想那样。

要实现这一点，一种典型的方法是使用一个不抛出异常但返回的函数。

def int2date(argdate: int):
    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        return ValueError("Value:{0} not a legal date.".format(argdate))

通过这个，你可以简单地将一个值列表映射到该函数，并将接收异常（当然缺少一个跟踪，但在这种情况下这不应该是一个问题）作为结果列表中的值：

然后，您可以遍历列表，使用None值替换找到的异常，并使用异常中包含的消息填充其他列。

在Python中引发错误后，如何在数据框中正确标记损坏的数据表

2 个答案: