Question

我有以下格式的Pandas DataFrame：

              rtt rexb
asn   country
12345 US      300 0.5
54321 US      150 0.2
12345 MX      160 0.15

我希望转储两个JSON文件：一个包含给定ASN的所有国家/地区的列表，另一个包含给定国家/地区的所有ASN：

country-by-asn.json:
{
    "12345": ["US", "MX"],
    "54321": ["US"]
}

asn-by-country.json:
{
    "US": ["12345", "54321"],
    "MX": ["54321"]
}

我目前正在做以下事情：

asns = df.index.levels[0]
countries = df.index.levels[1]

country_by_asn = {}
asn_by_country = {}

for asn in asns:
    by_asn = df.loc[[d == asn for d in df.index.get_level_values("asn")]]
    country_by_asn[asn] = list(by_asn.index.get_level_values("country"))

for country in countries:
    by_country = df.loc[[d == country for d in df.index.get_level_values("country")]]
    asn_by_country[country] = list(by_country.index.get_level_values("asn"))

这有效，但感觉有点笨重。是否有更高效的（在处理能力方面，不一定在代码复杂性方面）获得相同输出的方式？

实验证实是“笨重”。在68,000行数据上运行需要435秒

Answer 1

将reset_index与groupby一起使用，将值转换为list和最后to_json： - 在2.2秒内实验性地运行68,000行数据

df1 = df.reset_index()

a = df1.groupby('asn')['country'].apply(list).to_json()
b = df1.groupby('country')['asn'].apply(list).to_json()

或纯python解决方案 - 首先创建元组列表，然后创建字典和最后json： - 在0.06秒内实验性地运行68,000行数据

import json

l = df.index.tolist()

a, b = {}, {}
for x, y in l:
    a.setdefault(x, []).append(y)
    b.setdefault(y, []).append(y)

a = json.dumps(a)
b = json.dumps(b)

类似的解决方案： - 在0.06秒内通过实验运行68,000行数据

l = df.index.tolist()

from collections import defaultdict

a, b = defaultdict( list ), defaultdict( list )

for n,v in l:
    a[n].append(v)
    b[v].append(n)

a = json.dumps(a)
b = json.dumps(b)

@ stevendesu的“新手”解决方案： - 在0.06秒内试验了68,000行数据

l = df.index.tolist()

a, b = {}, {}

for n, v in l:
    if n not in a:
        a[n] = []
    if v not in b:
        b[v] = []
    a[n].append(v)
    b[v].append(n)

a = json.dumps(a)
b = json.dumps(b)

print (a)
{"12345": ["US", "MX"], "54321": ["US"]}

print (b)
{"MX": [12345], "US": [12345, 54321]}

熊猫：更高效的指数转储？

1 个答案: