根据解析后的文本 - python将多个布尔列添加到数据帧

时间:2017-01-08 14:34:32

标签: python performance pandas dataframe

我正在尝试通过基于“拆分器”解析选择列并将每个子字符串添加为列标题,然后将每个行标记为“True”或不为每个新列标记来构建数据框中的更多功能子字符串在初始拆分文本中找到。

我的问题是代码运行时间太长,并且会欣赏任何更有效的选项中的一些输入。

我正在处理的数据帧是大约12,700行和大约3,500列。

以下是代码:

def expand_df_col(df, col_name, splitter):

     series = set(df[col_name].dropna())

     new_columns = set()

     for values in series:
         new_columns = new_columns.union(set(values.split(splitter)))

     df = pd.concat([df,pd.DataFrame(columns=new_columns)], axis=1)

     for row in range(len(df)):
         for text in str(df.loc[row, col_name]).split(splitter):
             if text != "Not applicable":
                 df.loc[row, text] = True

     return df

例如:

                      Test 1              Test 2  
0             Will this work  Is this even legit  
1         Maybe it will work                nope  
2  It probably will not work                nope

应该成为:

                      Test 1              Test 2   not    It    it  will  \
0             Will this work  Is this even legit   NaN   NaN   NaN   NaN   
1         Maybe it will work                nope   NaN   NaN  True  True   
2  It probably will not work                nope  True  True   NaN  True   

    Maybe  Will  this  work probably  
0   NaN  True  True  True      NaN  
1  True   NaN   NaN  True      NaN  
2   NaN   NaN   NaN  True     True 

@Ted Petrou提供的回复几乎让我在那里但不完全:

def expand_df_col_test(df, col_name, splitter):
    df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

    df_melt = pd.melt(df_split, id_vars=col_name, var_name='count')

    df_temp = pd.pivot_table(df_melt, index=col_name, columns='value',      values='count', aggfunc=lambda x: True, fill_value=False)

    df_temp = df_temp.reindex(df.index)

    return df_temp

将测试df返回为:

value                         It  Maybe   Will     it    not probably   this  \
Test 1                                                                         
Will this work             False  False   True  False  False    False   True   
Maybe it will work         False   True  False   True  False    False  False   
It probably will not work   True  False  False  False   True     True  False   

value                       will  work  
Test 1                                  
Will this work             False  True  
Maybe it will work          True  True  
It probably will not work   True  True

作为跟进,我做了编辑。该函数适用于简单示例,但返回希望解析和扩展的原始列(如果存在pd.pivot_table()之后的代码),并且如果仅完成pd.pivot_table()部分则返回空数据帧

我不能为我的生活弄明白(花了整整一天的时间来修补和阅读所涉及的各种功能)。

我再次拥有~12K行和1-3K列,不确定这是否会影响输出。

当前功能:

def expand_df_col_test(df, col_name, splitter, reindex_col):

    import numpy as np

    replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

    df_split = pd.concat((df, df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

    df_melt = pd.melt(df_split, id_vars=list(df.columns), var_name='count')

    df_pivot = pd.pivot_table(df_melt, 
                 index=list(df.columns), 
                 columns=df_melt['value'], 
                 values=df_melt['count'], 
                 aggfunc=lambda x: True, 
                 fill_value= np.nan).reset_index(reindex_col).reindex(df[col_name]).reset_index()

    df_pivot.columns.name = ''

    return df_pivot

以为我找到了解决方案,但没有正确重新编制索引。

现在这个函数适用于一个子集,但我一直得到一个ValueError:无法从重复的轴重新索引

def expand_df_col_test(df, col_name, splitter, reindex_col):

import numpy as np

sub_df = pd.concat([df[col_name],df[reindex_col]], axis=1)

replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

df_split = pd.concat((sub_df,sub_df [col_name] .astype(str).str.split(splitter,expand = True)),axis = 1)

df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')

df_pivot = pd.pivot_table(df_melt, 
                 index=list(sub_df.columns), 
                 columns='value', 
                 values='count', 
                 aggfunc=lambda x: True, 
                 fill_value= np.nan)

print("pivot")
print(df_pivot)
print("NEXT RESET INDEX WITH REINDEX COL")
print(df_pivot.reset_index(reindex_col))
print("NEXT REINDEX")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]))
print("NEXT RESET INDEX()")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index())


df_pivot = df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index()

df_pivot.columns.name = ''

df_final = pd.concat([df,df_pivot.drop([col_name, reindex_col], axis=1)], axis = 1)

return df_final

3 个答案:

答案 0 :(得分:1)

更新了答案#2

df_list = [df]
for col_name in df.columns:
    splitter = ' '
    df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
    df_melt = pd.melt(df_split, id_vars=[col_name], var_name='count')
    df_list.append(pd.pivot_table(df_melt, 
                         index=[col_name], 
                         columns='value', 
                         values='count', 
                         aggfunc=lambda x: True, 
                         fill_value=np.nan).reindex(df[col_name]).reset_index(drop=True))
df_final = pd.concat(df_list, axis=1)

                      Test 1              Test 2    It Maybe  Will    it  \
0             Will this work  Is this even legit   NaN   NaN  True   NaN   
1         Maybe it will work                nope   NaN  True   NaN  True   
2  It probably will not work                nope  True   NaN   NaN   NaN   

    not probably  this  will  work    Is  even legit  nope  this  
0   NaN      NaN  True   NaN  True  True  True  True   NaN  True  
1   NaN      NaN   NaN  True  True   NaN   NaN   NaN  True   NaN  
2  True     True   NaN  True  True   NaN   NaN   NaN  True   NaN 

更新回答

此答案与之前的答案之间的唯一区别是您希望保留其他列Test 2。以下将完成此任务:

splitter = ' '
df_split = pd.concat((df, df['Test 1'].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=['Test 1', 'Test 2'], var_name='count')
df_pivot = pd.pivot_table(df_melt, 
                     index=['Test 1', 'Test 2'], 
                     columns='value', 
                     values='count', 
                     aggfunc=lambda x: True, 
                     fill_value=np.nan)\
             .reset_index('Test 2')\
             .reindex(df['Test 1'])\
             .reset_index()

df_pivot.columns.name = ''

                      Test 1              Test 2    It Maybe  Will    it  \
0             Will this work  Is this even legit   NaN   NaN  True   NaN   
1         Maybe it will work                nope   NaN  True   NaN  True   
2  It probably will not work                nope  True   NaN   NaN   NaN   

    not probably  this  will  work  
0   NaN      NaN  True   NaN  True  
1   NaN      NaN   NaN  True  True  
2  True     True   NaN  True  True 

旧答案

您需要提供带有示例结果的示例DataFrame,以获得更好,更快的答案。这是一个黑暗中的镜头。我将首先提供一个带有一些假数据的示例DataFrame,并尝试提供解决方案。

# create fake data
df = pd.DataFrame({'col1':['here is some text', 'some more text', 'finally some different text']})

输出df

                          col1
0            here is some text
1               some more text
2  finally some different text

col1中的每个值拆分为分割器(这里只有一个空格)

col_name = 'col1'
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

df_split

的输出
                          col1        0     1          2     3
0            here is some text     here    is       some  text
1               some more text     some  more       text  None
2  finally some different text  finally  some  different  text

将所有拆分放在一列

df_melt = pd.melt(df_split, id_vars='col1', var_name='count')

df_melt

的输出
                           col1 count      value
0             here is some text     0       here
1                some more text     0       some
2   finally some different text     0    finally
3             here is some text     1         is
4                some more text     1       more
5   finally some different text     1       some
6             here is some text     2       some
7                some more text     2       text
8   finally some different text     2  different
9             here is some text     3       text
10               some more text     3       None
11  finally some different text     3       text

最后,转动上面的DataFrame,使列成为拆分词

pd.pivot_table(df_melt, index='col1', columns='value', values='count', aggfunc=lambda x: True, fill_value=False)

输出

value                       different finally   here     is   more  some  text
col1                                                                          
finally some different text      True    True  False  False  False  True  True
here is some text               False   False   True   True  False  True  True
some more text                  False   False  False  False   True  True  True

答案 1 :(得分:0)

最后让它发挥作用,只是在感兴趣的列上执行了相同的方法并连接起来。

快了10000倍,非常感谢!

这是最终的工作解决方案:

def expand_df_col_test(df, col_name, splitter):

    import numpy as np

    sub_df = pd.concat([df[col_name],pd.Series(list(df.index))], axis=1).rename(columns={col_name : col_name, 0:'index'})

    replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

    df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

    df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')

    df_pivot = pd.pivot_table(df_melt, 
                 index=list(sub_df.columns), 
                 columns='value', 
                 values='count', 
                 aggfunc=lambda x: True, 
                 fill_value= np.nan).reset_index('index').sort('index').reset_index().drop([col_name, 'index'], axis=1)

    df_pivot.columns.name = ''

    df_final = pd.concat([df, df_pivot], axis = 1)

    return df_final

答案 2 :(得分:0)

我会在这里使用CountVectorizer

In [103]: df
Out[103]:
                       Test1               Test2
0             Will this work  Is this even legit
1         Maybe it will work                nope
2  It probably will not work                nope

In [104]: from sklearn.feature_extraction.text import CountVectorizer
     ...: vectorizer = CountVectorizer(min_df=1, lowercase=False)
     ...: X = vectorizer.fit_transform(df.Test1.fillna(''))
     ...:

In [105]: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [106]: r
Out[106]:
   It  Maybe  Will  it  not  probably  this  will  work
0   0      0     1   0    0         0     1     0     1
1   0      1     0   1    0         0     0     1     1
2   1      0     0   0    1         1     0     1     1

In [107]: df.join(r)
Out[107]:
                       Test1               Test2  It  Maybe  Will  it  not  probably  this  will  work
0             Will this work  Is this even legit   0      0     1   0    0         0     1     0     1
1         Maybe it will work                nope   0      1     0   1    0         0     0     1     1
2  It probably will not work                nope   1      0     0   0    1         1     0     1     1

或使用默认lowercase=True的标准方式(首先将所有单词设为小写):

In [111]: X = vectorizer.fit_transform(df.Test1.fillna(''))

In [112]: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [113]: r
Out[113]:
   it  maybe  not  probably  this  will  work
0   0      0    0         0     1     1     1
1   1      1    0         0     0     1     1
2   1      0    1         1     0     1     1

In [114]: df.join(r)
Out[114]:
                       Test1               Test2  it  maybe  not  probably  this  will  work
0             Will this work  Is this even legit   0      0    0         0     1     1     1
1         Maybe it will work                nope   1      1    0         0     0     1     1
2  It probably will not work                nope   1      0    1         1     0     1     1