提高从熊猫列中提取信息的速度

时间:2019-03-26 09:24:58

标签: python pandas dictionary

我有一个包含约200,000个数据点的数据框和一个看起来像这样的列(例如1个数据点的示例):

'{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'

我想提取有关名称和子弹的信息。我做了以下事情:

df["cat"], df["slug"] = np.nan, np.nan

for i in range(0, len(df.category)):
    df["cat"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[0]
    df["slug"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[4]

这很好用,但是大约需要4个小时。有什么方法可以使速度更快?

2 个答案:

答案 0 :(得分:1)

与其直接操作DataFrame,不如尝试使用简单的数据类型并一次性创建一个Dataframe。除jezrael之外的另一种解决方案:

import json

cat, slug = [], []

for row in df.category:
    d = json.loads(row)
    cat.append(d['cat'])
    slug.append(d['slug'])

df = pd.DataFrame({'cat': cat, 'slug': slug})

答案 1 :(得分:1)

您可以使用extract和正则表达式非常有效地完成此操作:

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

df

问题在于提高速度,所以这是性能比较(在100,000行样本上测试;请参见下面的注释):

%%timeit

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

309 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit

cat, slug = [], []
for row in df.category:
    d = json.loads(row)
    cat.append(d['name'])
    slug.append(d['slug'])

df1 = pd.DataFrame({'cat': cat, 'slug': slug})

574 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit

df1 = pd.DataFrame([ast.literal_eval(x) for x in df['category']],
                   index=df.index)[['name','slug']]

5.1 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意:示例是通过以下方式生成的:

x = '{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'
df = pd.DataFrame({'category': [x]*100000})