Question

我需要按如下方式存储数据：

[[[{}][{}]]]

或两个词典列表的列表列表

其中：

{}：包含观察事件的各个帧的数据的字典。（有两个观察员/站，因此有两个词典。）

[{}][{}]：与单个事件相关的所有单个帧的两个列表，每个观察者/站点一个。

[[{}][{}]]：一夜观察所有活动的清单。

[[[{}][{}]]]：所有夜晚的清单。

希望这很清楚。我想要做的是创建两个pandas数据帧，其中station_1中的所有词典都存储在一个中，而station_2中的所有词典都存储在另一个中。

我目前的方法如下（其中data是上述数据结构）：

for night in range(len(data)):

    station_1 = pd.DataFrame(data[night][0])
    station_2 = pd.DataFrame(data[night][1])

    all_station_1.append(station_1)
    all_station_2.append(station_2)

all_station_1 = pd.concat(all_station_1)
all_station_2 = pd.concat(all_station_2)

我的理解是for循环必须非常低效，因为我将从我的样本数据集中扩展这个脚本的应用程序，这个成本很容易变得难以管理。

因此，任何有关更智能的处理方式的建议都将受到赞赏！我觉得熊猫是如此用户友好，这是处理任何类型的数据结构的有效方式，但我还没有能够自己找到它。谢谢！

Answer 1

我不认为你真的可以避免在这里使用循环，除非你想通过sh调用jq。见this answer

无论如何，使用您的完整示例，我设法将其解析为多索引数据框，我认为这是您想要的。

import datetime
import re
import json

data=None
with open('datasample.txt', 'r') as f:
    data=f.readlines()
# There's only one line
data=data[0]

# Replace single quotes to double quotes: I did that in the .txt file itself, you could do it using re

# Fix the datetime problem
cleaned_data = re.sub(r'(datetime.datetime\(.*?\))', lambda x: '"'+ str(eval(x.group(0)).isoformat())+'"', data)

既然文件中的字符串是有效的json，我们可以加载它：

json_data = json.loads(cleaned_data)

我们可以将其处理成数据帧：

# List to store the dfs before concat
all_ = []
for n, night in enumerate(json_data):
    for s, station in enumerate(night):
        events = pd.DataFrame(station)
        # Set index to the event number
        events = events.set_index('###')
        # Prepend night number and station number to index
        events.index = pd.MultiIndex.from_tuples([(n, s, x) for x in events.index])
        all_.append(events)

df_all = pd.concat(all_)
# Rename the index levels
df_all.index.names = ['Night','Station','Event']
# Convert to datetime
df_all.DateTime = pd.to_datetime(df_all.DateTime)
df_all

（截断）结果：

来自复杂数据结构的数据帧

1 个答案: