如何优雅地处理空熊猫数据帧

时间:2020-05-19 15:01:34

标签: python pandas

我有熊猫代码,可从JSON中的REST API接收数据。根据请求,数据可能是一个空数组,或者某些值可能会丢失(在这些JSON数据包中未发送空值)。

我尝试编写熊猫代码来优雅地处理此问题,而不必专门测试empty。我目前很难做到这一点。例如,下面的代码运行良好,但只有一行:

import json
from unittest import TestCase

import dateutil
import numpy as np
import pandas as pd
import pytz


class TestClient(TestCase):
    def test(self):

        # I get data from a JSON API, but it can be empty (or missing some values if those are null)        
        json_dict = json.loads("[]")
        dfo = pd.DataFrame.from_dict(json_dict)
        # in case the json is partially empty create required columns
        for col in ("from", "to", "value"):
            if col not in dfo.columns:
                dfo[col] = np.nan

        # remove some duplicates - works fine
        dfo.sort_values(by=['from', 'to'], inplace=True)
        dfo.drop_duplicates(subset='from', keep='last', inplace=True)

        # parse timestamps - works fine
        dfo['from'] = dfo['from'].apply(dateutil.parser.parse)
        dfo['to'] = dfo['to'].apply(dateutil.parser.parse)

        # localize timestamps - works fine
        local_tz = pytz.timezone("Europe/Zurich")
        dfo['from_local'] = dfo['from'].apply(lambda dt: dt.astimezone(local_tz))
        dfo['to_local'] = dfo['to'].apply(lambda dt: dt.astimezone(local_tz))

        # some more datetime maths - works fine
        dfo['duration'] = dfo['to'] - dfo['from']

        # extract the date - fails
        dfo['to_date'] = dfo['to_local'].dt.date # fails with AttributeError: Can only use .dt accessor with datetimelike values
        # But I could use the code below instead, which does the same thing, and works
        # dfo['to_date'] = dfo['to_local'].apply(lambda r: r.date())

        # calculate some mean - works fine 
        the_mean = dfo['value'].mean() # OK, returns NaN

您能推荐一种可靠地处理可能为空的数据帧的方法吗?有最佳做法吗?

在上面的代码中,我可以声明数据类型来避免使用AttributeError吗?

我的期望是否错误,相同的处理也应该在空数据帧上运行? (您真的必须想象并测试所有可能的极端情况)

1 个答案:

答案 0 :(得分:1)

问题在于,新创建的数据框中的空列的类型为float64,它不是datetimelike。
因此,最简单的方法是将要使用dt评估器的所有列显式转换为datetime类型:

dfo['to_local'] = pd.to_datetime(dfo['to_local'])

您只需执行一次,例如创建后。如果您稍后从数据框中删除所有行,并且该行变为空,则它将保留其列类型。

相关问题