我正在尝试通过基于“拆分器”解析选择列并将每个子字符串添加为列标题,然后将每个行标记为“True”或不为每个新列标记来构建数据框中的更多功能子字符串在初始拆分文本中找到。
我的问题是代码运行时间太长,并且会欣赏任何更有效的选项中的一些输入。
我正在处理的数据帧是大约12,700行和大约3,500列。
以下是代码:
def expand_df_col(df, col_name, splitter):
series = set(df[col_name].dropna())
new_columns = set()
for values in series:
new_columns = new_columns.union(set(values.split(splitter)))
df = pd.concat([df,pd.DataFrame(columns=new_columns)], axis=1)
for row in range(len(df)):
for text in str(df.loc[row, col_name]).split(splitter):
if text != "Not applicable":
df.loc[row, text] = True
return df
例如:
Test 1 Test 2
0 Will this work Is this even legit
1 Maybe it will work nope
2 It probably will not work nope
应该成为:
Test 1 Test 2 not It it will \
0 Will this work Is this even legit NaN NaN NaN NaN
1 Maybe it will work nope NaN NaN True True
2 It probably will not work nope True True NaN True
Maybe Will this work probably
0 NaN True True True NaN
1 True NaN NaN True NaN
2 NaN NaN NaN True True
@Ted Petrou提供的回复几乎让我在那里但不完全:
def expand_df_col_test(df, col_name, splitter):
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=col_name, var_name='count')
df_temp = pd.pivot_table(df_melt, index=col_name, columns='value', values='count', aggfunc=lambda x: True, fill_value=False)
df_temp = df_temp.reindex(df.index)
return df_temp
将测试df返回为:
value It Maybe Will it not probably this \
Test 1
Will this work False False True False False False True
Maybe it will work False True False True False False False
It probably will not work True False False False True True False
value will work
Test 1
Will this work False True
Maybe it will work True True
It probably will not work True True
作为跟进,我做了编辑。该函数适用于简单示例,但返回希望解析和扩展的原始列(如果存在pd.pivot_table()之后的代码),并且如果仅完成pd.pivot_table()部分则返回空数据帧
我不能为我的生活弄明白(花了整整一天的时间来修补和阅读所涉及的各种功能)。
我再次拥有~12K行和1-3K列,不确定这是否会影响输出。
当前功能:
def expand_df_col_test(df, col_name, splitter, reindex_col):
import numpy as np
replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)
df_split = pd.concat((df, df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=list(df.columns), var_name='count')
df_pivot = pd.pivot_table(df_melt,
index=list(df.columns),
columns=df_melt['value'],
values=df_melt['count'],
aggfunc=lambda x: True,
fill_value= np.nan).reset_index(reindex_col).reindex(df[col_name]).reset_index()
df_pivot.columns.name = ''
return df_pivot
以为我找到了解决方案,但没有正确重新编制索引。
现在这个函数适用于一个子集,但我一直得到一个ValueError:无法从重复的轴重新索引
def expand_df_col_test(df, col_name, splitter, reindex_col):
import numpy as np
sub_df = pd.concat([df[col_name],df[reindex_col]], axis=1)
replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)
df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')
df_pivot = pd.pivot_table(df_melt,
index=list(sub_df.columns),
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value= np.nan)
print("pivot")
print(df_pivot)
print("NEXT RESET INDEX WITH REINDEX COL")
print(df_pivot.reset_index(reindex_col))
print("NEXT REINDEX")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]))
print("NEXT RESET INDEX()")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index())
df_pivot = df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index()
df_pivot.columns.name = ''
df_final = pd.concat([df,df_pivot.drop([col_name, reindex_col], axis=1)], axis = 1)
return df_final
答案 0 :(得分:1)
df_list = [df]
for col_name in df.columns:
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=[col_name], var_name='count')
df_list.append(pd.pivot_table(df_melt,
index=[col_name],
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value=np.nan).reindex(df[col_name]).reset_index(drop=True))
df_final = pd.concat(df_list, axis=1)
Test 1 Test 2 It Maybe Will it \
0 Will this work Is this even legit NaN NaN True NaN
1 Maybe it will work nope NaN True NaN True
2 It probably will not work nope True NaN NaN NaN
not probably this will work Is even legit nope this
0 NaN NaN True NaN True True True True NaN True
1 NaN NaN NaN True True NaN NaN NaN True NaN
2 True True NaN True True NaN NaN NaN True NaN
此答案与之前的答案之间的唯一区别是您希望保留其他列Test 2
。以下将完成此任务:
splitter = ' '
df_split = pd.concat((df, df['Test 1'].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=['Test 1', 'Test 2'], var_name='count')
df_pivot = pd.pivot_table(df_melt,
index=['Test 1', 'Test 2'],
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value=np.nan)\
.reset_index('Test 2')\
.reindex(df['Test 1'])\
.reset_index()
df_pivot.columns.name = ''
Test 1 Test 2 It Maybe Will it \
0 Will this work Is this even legit NaN NaN True NaN
1 Maybe it will work nope NaN True NaN True
2 It probably will not work nope True NaN NaN NaN
not probably this will work
0 NaN NaN True NaN True
1 NaN NaN NaN True True
2 True True NaN True True
您需要提供带有示例结果的示例DataFrame,以获得更好,更快的答案。这是一个黑暗中的镜头。我将首先提供一个带有一些假数据的示例DataFrame,并尝试提供解决方案。
# create fake data
df = pd.DataFrame({'col1':['here is some text', 'some more text', 'finally some different text']})
输出df
col1
0 here is some text
1 some more text
2 finally some different text
将col1
中的每个值拆分为分割器(这里只有一个空格)
col_name = 'col1'
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
df_split
col1 0 1 2 3
0 here is some text here is some text
1 some more text some more text None
2 finally some different text finally some different text
将所有拆分放在一列
df_melt = pd.melt(df_split, id_vars='col1', var_name='count')
df_melt
col1 count value
0 here is some text 0 here
1 some more text 0 some
2 finally some different text 0 finally
3 here is some text 1 is
4 some more text 1 more
5 finally some different text 1 some
6 here is some text 2 some
7 some more text 2 text
8 finally some different text 2 different
9 here is some text 3 text
10 some more text 3 None
11 finally some different text 3 text
最后,转动上面的DataFrame,使列成为拆分词
pd.pivot_table(df_melt, index='col1', columns='value', values='count', aggfunc=lambda x: True, fill_value=False)
输出
value different finally here is more some text
col1
finally some different text True True False False False True True
here is some text False False True True False True True
some more text False False False False True True True
答案 1 :(得分:0)
最后让它发挥作用,只是在感兴趣的列上执行了相同的方法并连接起来。
快了10000倍,非常感谢!这是最终的工作解决方案:
def expand_df_col_test(df, col_name, splitter):
import numpy as np
sub_df = pd.concat([df[col_name],pd.Series(list(df.index))], axis=1).rename(columns={col_name : col_name, 0:'index'})
replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)
df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')
df_pivot = pd.pivot_table(df_melt,
index=list(sub_df.columns),
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value= np.nan).reset_index('index').sort('index').reset_index().drop([col_name, 'index'], axis=1)
df_pivot.columns.name = ''
df_final = pd.concat([df, df_pivot], axis = 1)
return df_final
答案 2 :(得分:0)
我会在这里使用CountVectorizer:
In [103]: df
Out[103]:
Test1 Test2
0 Will this work Is this even legit
1 Maybe it will work nope
2 It probably will not work nope
In [104]: from sklearn.feature_extraction.text import CountVectorizer
...: vectorizer = CountVectorizer(min_df=1, lowercase=False)
...: X = vectorizer.fit_transform(df.Test1.fillna(''))
...:
In [105]: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
In [106]: r
Out[106]:
It Maybe Will it not probably this will work
0 0 0 1 0 0 0 1 0 1
1 0 1 0 1 0 0 0 1 1
2 1 0 0 0 1 1 0 1 1
In [107]: df.join(r)
Out[107]:
Test1 Test2 It Maybe Will it not probably this will work
0 Will this work Is this even legit 0 0 1 0 0 0 1 0 1
1 Maybe it will work nope 0 1 0 1 0 0 0 1 1
2 It probably will not work nope 1 0 0 0 1 1 0 1 1
或使用默认lowercase=True
的标准方式(首先将所有单词设为小写):
In [111]: X = vectorizer.fit_transform(df.Test1.fillna(''))
In [112]: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
In [113]: r
Out[113]:
it maybe not probably this will work
0 0 0 0 0 1 1 1
1 1 1 0 0 0 1 1
2 1 0 1 1 0 1 1
In [114]: df.join(r)
Out[114]:
Test1 Test2 it maybe not probably this will work
0 Will this work Is this even legit 0 0 0 0 1 1 1
1 Maybe it will work nope 1 1 0 0 0 1 1
2 It probably will not work nope 1 0 1 1 0 1 1