让我解释一下。我的df
看起来像这样:
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
c1
仅包含两个值1或-1
现在我想要一个这样的数据帧(输出):
Word Totalcount Points PercentageOfPointAndTotalCount
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
这里,Totalcount
是每个单词出现在text
列中的总次数。
points
是每个单词的c1
的总和。示例:people
的单词在两行中,其中c1
为1,在一行中c1
为-1
。因此,关键点是1(2-1 = 1)。
PercentageOfPointAndTotalCount =积分/总数* 100
print(df)
id comment_text target
0 59848 Hello world -1.0
1 59849 Hello world -1.0
答案 0 :(得分:3)
我在str.split,
之后使用unnesting,那么我们只需要groupby
+ agg
unnesting(df,['text']).groupby('text').c1.agg(['count','sum'])
Out[873]:
count sum
text
Evening 1 -1
Good 2 -2
Hello 2 2
I 1 1
Morning 1 -1
am 1 1
are 1 1
fine 1 1
how 1 1
people 4 2
world 1 1
you 1 1
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
答案 1 :(得分:1)
这是一个独立的版本:
new_df = (df.set_index('c1').text.str.split().apply(pd.Series)
.stack().reset_index().drop('level_1', axis=1))
new_df.groupby(0).c1.agg(['sum','count'])
输出:
+---------+-----+-------+
| | sum | count |
+---------+-----+-------+
| 0 | | |
+---------+-----+-------+
| Evening | -1 | 1 |
| Good | -2 | 2 |
| Hello | 2 | 2 |
| I | 1 | 1 |
| Morning | -1 | 1 |
| am | 1 | 1 |
| are | 1 | 1 |
| fine | 1 | 1 |
| how | 1 | 1 |
| people | 2 | 4 |
| world | 1 | 1 |
| you | 1 | 1 |
+---------+-----+-------+