来自python初学者的问题!我有一个看起来像这样的tsv文件:
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
WHI5 YOR083W SWI4 YER111C physical interactions 15210110
WHI5 YOR083W SWI4 YER111C physical interactions 15210111
我想计算行[3]中包含相同单词的所有行,并且只输出第一个带有新列中出现次数的行。
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
到目前为止,我尝试了'csv'和'Counter'或'pandas'和'Counter'的组合但没有成功......
答案 0 :(得分:3)
使用pandas:
>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 19823668
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 21658602
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 24186061
... WHI5 YOR083W RPD3 YNL330C "physical interactions" 19823668
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210110
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210111"""),
... delim_whitespace=True)
pandas数据框将如下所示:
>>> df
col1 col2 col3 col4 col5 col6
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111
[6 rows x 6 columns]
获取计数,按col3
分组,并取每组的长度:
>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
col1 col2 col3 col4 col5 col6 cnt
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602 3
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061 3
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111 2
[6 rows x 7 columns]
选择每组的第一个:
>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
col1 col2 col3 col4 col5 col6 cnt
col3
CDC28 0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
RPD3 3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
SWI4 4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
[3 rows x 7 columns]