我的问题与我过去的问题有关:Split text in cells and create additional rows for the tokens。
让我们假设我在DataFrame
的{{1}}中具有以下内容:
pandas
并且我想将每个id的文本分割为随机数的单词的令牌(在两个值之间变化,例如1和5),所以我最终想要具有以下内容:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
请记住,我的数据框可能还具有其他列,除了这两列外,其他列应该以与上述id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
相同的方式简单地复制到新数据框中。
最有效的方法是什么?
答案 0 :(得分:2)
使用itertools.islice
定义一个函数以随机方式提取块:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
通过列表理解调用该函数以确保尽可能少的开销,然后stack
获得输出:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
您可以扩展extract_chunks
函数来执行令牌化。现在,我对空格进行了简单的分割,您可以对其进行修改。
请注意,如果您不想触摸其他列,则可以在此处执行类似melt
的操作。
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))