使用熊猫将句子拆分为包含不同数量单词的子字符串

时间:2019-06-06 17:11:58

标签: python string pandas tokenize

我的问题与我过去的问题有关:Split text in cells and create additional rows for the tokens

让我们假设我在DataFrame的{​​{1}}中具有以下内容:

pandas

并且我想将每个id的文本分割为随机数的单词的令牌(在两个值之间变化,例如1和5),所以我最终想要具有以下内容:

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

请记住,我的数据框可能还具有其他列,除了这两列外,其他列应该以与上述id text 1 I am the 1 first document 1 and I am very 1 happy 2 Here is 2 the second document and it 2 likes playing 2 tennis 3 This is the third 3 document and 3 looks very 3 very good today 相同的方式简单地复制到新数据框中。

最有效的方法是什么?

1 个答案:

答案 0 :(得分:2)

使用itertools.islice定义一个函数以随机方式提取块:

from itertools import islice
import random

lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
    chunks = []
    while True:
        chunk = list(islice(it, random.choice(range(lo, hi+1))))
        if not chunk:
            break
        chunks.append(' '.join(chunk))

    return chunks

通过列表理解调用该函数以确保尽可能少的开销,然后stack获得输出:

pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()

id   
1   0                    I am the
    1        first document and I
    2              am very happy.
2   0                 Here is the
    1         second document and
    2    it likes playing tennis.
3   0           This is the third
    1       document and it looks
    2            very good today.

您可以扩展extract_chunks函数来执行令牌化。现在,我对空格进行了简单的分割,您可以对其进行修改。


请注意,如果您不想触摸其他列,则可以在此处执行类似melt的操作。

u = pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']])

(pd.concat([df.drop('text', 1), u], axis=1)
   .melt(df.columns.difference(['text'])))