我有熊猫代码,可从JSON中的REST API接收数据。根据请求,数据可能是一个空数组,或者某些值可能会丢失(在这些JSON数据包中未发送空值)。
我尝试编写熊猫代码来优雅地处理此问题,而不必专门测试empty
。我目前很难做到这一点。例如,下面的代码运行良好,但只有一行:
import json
from unittest import TestCase
import dateutil
import numpy as np
import pandas as pd
import pytz
class TestClient(TestCase):
def test(self):
# I get data from a JSON API, but it can be empty (or missing some values if those are null)
json_dict = json.loads("[]")
dfo = pd.DataFrame.from_dict(json_dict)
# in case the json is partially empty create required columns
for col in ("from", "to", "value"):
if col not in dfo.columns:
dfo[col] = np.nan
# remove some duplicates - works fine
dfo.sort_values(by=['from', 'to'], inplace=True)
dfo.drop_duplicates(subset='from', keep='last', inplace=True)
# parse timestamps - works fine
dfo['from'] = dfo['from'].apply(dateutil.parser.parse)
dfo['to'] = dfo['to'].apply(dateutil.parser.parse)
# localize timestamps - works fine
local_tz = pytz.timezone("Europe/Zurich")
dfo['from_local'] = dfo['from'].apply(lambda dt: dt.astimezone(local_tz))
dfo['to_local'] = dfo['to'].apply(lambda dt: dt.astimezone(local_tz))
# some more datetime maths - works fine
dfo['duration'] = dfo['to'] - dfo['from']
# extract the date - fails
dfo['to_date'] = dfo['to_local'].dt.date # fails with AttributeError: Can only use .dt accessor with datetimelike values
# But I could use the code below instead, which does the same thing, and works
# dfo['to_date'] = dfo['to_local'].apply(lambda r: r.date())
# calculate some mean - works fine
the_mean = dfo['value'].mean() # OK, returns NaN
您能推荐一种可靠地处理可能为空的数据帧的方法吗?有最佳做法吗?
在上面的代码中,我可以声明数据类型来避免使用AttributeError
吗?
我的期望是否错误,相同的处理也应该在空数据帧上运行? (您真的必须想象并测试所有可能的极端情况)
答案 0 :(得分:1)
问题在于,新创建的数据框中的空列的类型为float64
,它不是datetimelike。
因此,最简单的方法是将要使用dt
评估器的所有列显式转换为datetime
类型:
dfo['to_local'] = pd.to_datetime(dfo['to_local'])
您只需执行一次,例如创建后。如果您稍后从数据框中删除所有行,并且该行变为空,则它将保留其列类型。