将带有可变长度逗号分隔值的pandas Series转换为Dataframe

时间:2016-01-29 15:07:13

标签: python pandas

我有一只熊猫系列' A'包含逗号分隔值,如下所示:

index    A

1        null
2        5,6
3        3
4        null
5        5,18,22
...      ...

我需要一个像这样的数据框:

index    A_5    A_6    A_18    A_20

1        0      0      0       ...
2        1      1      0       ...
3        0      0      0       ...
4        0      0      0       ...
5        1      0      1       ...
...      ...    ...    ...     ...

应该忽略不至少发生MIN_OBS次数的值而不会获得自己的列,因为如果不应用此阈值,则存在许多不同的值,df将变得太大。 / p>

我设计了以下解决方案。它工作,但是太慢了(由于迭代我想的行)。有谁能建议更快的方法?

temp_dict = defaultdict(int)
for k, v in A.iteritems():
    temp_list = v.split(',')
    for item in temp_list:
        temp_dict[item] += 1

cols_to_make = []
for k, v in temp_dict.iteritems():
    if v > MIN_OBS:
        cols_to_make.append('A_' + k)

result_df = pd.DataFrame(0, index = the_series.index, columns = cols_to_make)
for k, v in A.iteritems():
    temp_list = v.split(',')
    for item in temp_list:
    if ('A_' + item) in cols_to_make:
        temp_df['A_' + item][k] = 1

3 个答案:

答案 0 :(得分:3)

您可以使用get_dummies创建指标变量,然后按to_numeric将列转换为数字,按变量TRESHix转换最后的过滤列:

print df
             A
index         
1         null
2          5,6
3            3
4         null
5      5,18,22

df = df.A.str.get_dummies(sep=",")
print df
       18  22  3  5  6  null
index                       
1       0   0  0  0  0     1
2       0   0  0  1  1     0
3       0   0  1  0  0     0
4       0   0  0  0  0     1
5       1   1  0  1  0     0

df.columns = pd.to_numeric(df.columns, errors='coerce')
df = df.sort_index(axis=1)

TRESH = 5
cols = [col for col in df.columns if col > TRESH]
print cols
[6.0, 18.0, 22.0]
df = df.ix[:, cols]
print df
       6   18  22
index            
1       0   0   0
2       1   0   0
3       0   0   0
4       0   0   0
5       0   1   1

df.columns = ["A_" + str(int(col)) for col in df.columns]
print df
       A_6  A_18  A_22
index                 
1        0     0     0
2        1     0     0
3        0     0     0
4        0     0     0
5        0     1     1

编辑:

我尝试修改完美原始unutbu answer并更改创建Series,在Series中删除null index个值并添加参数prefix get_dummies

import numpy as np
import pandas as pd

s = pd.Series(['null', '5,6', '3', 'null', '5,18,22', '3,4'])
print s

#result = s.str.split(',').apply(pd.Series).stack()
#replacing to:
result = pd.DataFrame([ x.split(',') for x in s ]).stack()
count = pd.value_counts(result)

min_obs = 2

#add removing Series, which contains null
count = count[(count >= min_obs) & ~(count.index.isin(['null'])) ]

result = result.loc[result.isin(count.index)]
#add prefix to function get_dummies
result = pd.get_dummies(result, prefix="A")

result.index = result.index.droplevel(1)
result = result.reindex(s.index)

print(result)
   A_3  A_5
0  NaN  NaN
1    0    1
2    1    0
3  NaN  NaN
4    0    1
5    1    0

时序:

In [143]: %timeit pd.DataFrame([ x.split(',') for x in s ]).stack()
1000 loops, best of 3: 866 µs per loop

In [144]: %timeit s.str.split(',').apply(pd.Series).stack()
100 loops, best of 3: 2.46 ms per loop

答案 1 :(得分:2)

由于内存是一个问题,我们必须小心不要构建大型中间件 数据结构,如果可能的话。

让我们从OP发布的代码开始:

def orig(A, MIN_OBS):
    temp_dict = collections.defaultdict(int)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    cols_to_make = []
    for k, v in temp_dict.iteritems():
        if v > MIN_OBS:
            cols_to_make.append('A_' + k)

    result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            if ('A_' + item) in cols_to_make:
                result_df['A_' + item][k] = 1
    return result_df

并将第一个循环提取到它自己的函数中:

def count(A, MIN_OBS):
    temp_dict = collections.Counter()
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
    return temp_dict

从交互式会话中的实验中,我们可以看出这不是瓶颈;即使对于“大型”DataFrame,count(A, MIN_OBS)也能很快完成。

orig的缓慢发生在for-loop末尾的双orig中 其增量一次修改DataFrame中的单元格一个值 (例如result_df['A_' + item][k] = 1。)

我们可以使用向量化字符串方法A.str.contains在DataFrame的列上用单个for循环替换双for循环,以搜索字符串中的值。由于我们从未将原始字符串拆分为Python字符串列表(或者包含字符串片段的Pandas DataFrames),因此我们节省了一些内存。 由于origalt使用类似的数据结构,因此它们的内存占用量大致相同。

def alt(A, MIN_OBS):
    temp_dict = count(A, MIN_OBS)
    df = pd.DataFrame(0, index=A.index, columns=temp_dict)
    for col in df:
        df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
    df.columns = ['A_{}'.format(col) for col in df]
    return df

这是一个示例,在具有40K不同可能值的200K行DataFrame上:

import numpy as np
import pandas as pd
import collections

np.random.seed(2016)
ncols = 5
nrows = 200000
nvals = 40000
MIN_OBS = 200

# nrows = 20
# nvals = 4
# MIN_OBS = 2

idx = np.random.randint(ncols, size=nrows).cumsum()
data = np.random.choice(np.arange(nvals), size=idx[-1])
data = np.array_split(data, idx[:-1])
data = map(','.join, [map(str, arr) for arr in data])
A = pd.Series(data)
A.loc[A == ''] = 'null'

def orig(A, MIN_OBS):
    temp_dict = collections.defaultdict(int)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    cols_to_make = []
    for k, v in temp_dict.iteritems():
        if v > MIN_OBS:
            cols_to_make.append('A_' + k)

    result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            if ('A_' + item) in cols_to_make:
                result_df['A_' + item][k] = 1
    return result_df

def count(A, MIN_OBS):
    temp_dict = collections.Counter()
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
    return temp_dict

def alt(A, MIN_OBS):
    temp_dict = count(A, MIN_OBS)
    df = pd.DataFrame(0, index=A.index, columns=temp_dict)
    for col in df:
        df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
    df.columns = ['A_{}'.format(col) for col in df]
    return df

这是一个基准:

In [48]: %timeit expected = orig(A, MIN_OBS)
1 loops, best of 3: 3.03 s per loop

In [49]: %timeit expected = alt(A, MIN_OBS)
1 loops, best of 3: 483 ms per loop

请注意,alt完成所需的大部分时间都花在count上:

In [60]: %timeit count(A, MIN_OBS)
1 loops, best of 3: 304 ms per loop

答案 2 :(得分:0)

这类似的工作还是可以根据您的需要进行修改?

df = pd.DataFrame({'A': ['null', '5,6', '3', 'null', '5,18,22']}, columns=['A'])

         A
0     null
1      5,6
2        3
3     null
4  5,18,22

然后使用get_dummies()

pd.get_dummies(df['A'].str.split(',').apply(pd.Series), prefix=df.columns[0])

结果:

       A_3  A_5  A_null  A_18  A_6  A_22
index                                   
1        0    0       1     0    0     0
2        0    1       0     0    1     0
3        1    0       0     0    0     0
4        0    0       1     0    0     0
5        0    1       0     1    0     1