我有一个DataFrame,它有一个字符串列,如下所示:
id text label
1 this is long string with many words 1
2 this is a middle string 0
3 short string 1
我希望根据字符串长度将此DataFrame转换为另一个DataFrame,即(df['text'].str.len > 3
):
id text label
1 this is long 1
1 string with many 1
1 words 1
2 this is a 0
2 middle string 0
3 short string 1
这是我的代码:
pd.concat(df['text'].str.len() > 200)
但这是错误的。
答案 0 :(得分:0)
IIUC
v=df.text.str.split(' ')
s=pd.DataFrame({'text':v.sum(),'label':df.label.repeat(v.str.len())})
s['New']=s.groupby(s.index).cumcount()
s.groupby([s.New//3,s.index.get_level_values(level=0)]).agg({'text':lambda x : ' '.join(x),'label':'first'}).sort_index(level=1)
Out[1785]:
text label
New
0 0 this is long 1
1 0 string with many 1
2 0 words 1
0 1 this is a 0
1 1 middle string 0
0 2 short string 1
答案 1 :(得分:0)
你可以
In [1257]: n = 3
In [1279]: df.set_index(['label', 'id'])['text'].str.split().apply(
lambda x: pd.Series([' '.join(x[i:i+n]) for i in range(0, len(x), n)])
).stack().reset_index().drop('level_2', 1)
Out[1279]:
label id 0
0 1 1 this is long
1 1 1 string with many
2 1 1 words
3 0 2 this is a
4 0 2 middle string
5 1 3 short string
详细
label text id
0 1 this is long string with many words 1
1 0 this is a middle string 2
2 1 short string 3
答案 2 :(得分:0)
这是一个解决方案,使用几个for循环将文本拆分为3个集:
array = []
for ii,row in df.iterrows():
if row['text'].split() > 3:
jj = 0
while jj < len(row['text'].split()):
array.append(
pd.Series(
{'id':row['id'],'label':row['label'],
'text':row['text'].split()[jj:jj+3]}
)
)
jj += 3
else:
array.append(row)