我有一个数据帧csv_table
,看起来像这样:
| time | ID | range | text |
|:-----:|:----------------:|:-----:|:--------------------------------------------------:|
| 90000 | B0A0F80A06A3AB6C | 0 | In what year did baseball become an offical sport? |
| 90000 | 95A33E619934A39B | 0 | wirehair pointing griffon |
| 90000 | E613C21C535BC636 | 30 | ncic |
| 90000 | 687340036669C45D | 0 | kitchen appliances |
| 90000 | E43DD6D82BFBD0B8 | 0 | where can I find a chines rosewood |
| 90000 | CA52ECD1524E737D | 0 | jennifer love hewitt naked |
| 90000 | 2B4FAF545C0E6EF0 | 40 | pageant trim |
| 90000 | 6456584F5B316AAE | 100 | tiger electronics
|
(该文件实际可以存储约30万个条目)
我想做的是按ID找出平均条目数。
在SQL中,我会做类似的事情:
WITH
Counts AS (
SELECT
COUNT(text) AS TheCnt,
ID
FROM
csv_table
GROUP BY
ID
),
Tots AS (
SELECT
AVG(TheCnt) AS TheAvg
FROM
Counts
)
SELECT * FROM Tots
我尝试编写一些Python代码以达到相同的结果:
import pandas as pd
tsv_file = "filepath"
csv_table=pd.read_csv(tsv_file, sep='\t', header=None)
csv_table.columns = ['time', 'ID', 'range', 'text']
val = csv_table.groupby('ID').count()
print(val)
但是我得到了
time range text
ID
0000177584E874EC 1 1 1
00006291C83E2C2A 2 2 2
00006FD94F3A9CB4 1 1 1
000087A6525FEED2 4 4 4
我如何达到我想要的结果?我显然是在计算每位用户的文字数量,但随后要查找文字的平均值?
答案 0 :(得分:2)
我假设您只想要一个最后的数字对吗?如果是这样,那就是:
val['text'].mean()