Question

我有这样的数据集：

text          dialog      class_1     class_2     
hi              1            0           1
how are you?    1            0           1
I'm find        0            1           0
And You?        0            1           0

I want to transform the dataset like this:

word         dialog      class_1     class_2     
hi              1            0           1
how             1            0           1
are             1            0           1
you             1            0           1
?               1            0           1
I'm             0            1           0
find            0            1           0
And             0            1           0
You             0            1           0
?               0            1           0

基本上我有一个包含句子（文本）的列。我想将它拆分成一个包含所有单词的列，保留列：对话框和具有相同属性的类。

我的数据集是由pandas库创建的。

我的代码：

ct=0
sentences2=[]
for j in dataset['text']:
        sentences1=str.split(dataset.iloc[ct][0])
        sentences2.append(sentences1)
        ct=ct+1

i=0
ii=0
new_dataset=[]
for q in dataset.iloc[i]:
    for qq in sentences2[ii]:
        new_dataset.append(pd.concat([dataset.iloc[i]]*len(sentences2[ii]),ignore_index=False))
        if(i<=len(dataset)):
           i=i+1
        if(ii<=len(sentences2)):
           ii=ii+1

当i = 5且ii = 5时，循环停止。我不知道为什么。

Answer 1

假设您的DataFrame被称为df，您可以使用stack和reset_index加入。

import pandas as pd
df2 = pd.DataFrame(
    pd.DataFrame(
        df.text.str.split().tolist(), index=df.index
    ).stack().reset_index(level=1, drop=True)).join(df)

您可能必须找到一种更好的方法来分割文本（split（）按空格分割）。您可以在拆分函数中使用任何正则表达式。

重复的pandas数据集行

1 个答案: