如何制作一个包含每个单词的新数据框,并使用其他列进行计数

时间:2019-04-18 14:20:45

标签: python pandas

让我解释一下。我的df看起来像这样:

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

c1仅包含两个值1或-1

现在我想要一个这样的数据帧(输出):

Word      Totalcount     Points      PercentageOfPointAndTotalCount

hello        2             2              100
world        1             1              100
how          1             1              100
are          1             1              100
you          1             1              100
people       3             1              33.33
I            1             1              100
am           1             1              100
fine         1             1              100
Good         2             -2            -100
Morning      1             -1            -100
Evening      1             -1            -100

这里,Totalcount是每个单词出现在text列中的总次数。

points是每个单词的c1的总和。示例:people的单词在两行中,其中c1为1,在一行中c1-1。因此,关键点是1(2-1 = 1)。

PercentageOfPointAndTotalCount =积分/总数* 100

print(df)

      id comment_text  target
0  59848  Hello world    -1.0
1  59849  Hello world    -1.0

2 个答案:

答案 0 :(得分:3)

我在str.split,之后使用unnesting,那么我们只需要groupby + agg

unnesting(df,['text']).groupby('text').c1.agg(['count','sum'])
Out[873]: 
         count  sum
text               
Evening      1   -1
Good         2   -2
Hello        2    2
I            1    1
Morning      1   -1
am           1    1
are          1    1
fine         1    1
how          1    1
people       4    2
world        1    1
you          1    1

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

答案 1 :(得分:1)

这是一个独立的版本:

new_df = (df.set_index('c1').text.str.split().apply(pd.Series)
      .stack().reset_index().drop('level_1', axis=1))

new_df.groupby(0).c1.agg(['sum','count'])

输出:

+---------+-----+-------+
|         | sum | count |
+---------+-----+-------+
|    0    |     |       |
+---------+-----+-------+
| Evening |  -1 |     1 |
| Good    |  -2 |     2 |
| Hello   |   2 |     2 |
| I       |   1 |     1 |
| Morning |  -1 |     1 |
| am      |   1 |     1 |
| are     |   1 |     1 |
| fine    |   1 |     1 |
| how     |   1 |     1 |
| people  |   2 |     4 |
| world   |   1 |     1 |
| you     |   1 |     1 |
+---------+-----+-------+