我有一个输入文件(csv-file),其数据在group
列中有重复的条目,并且列size
中可能有重复的条目。
下面给出了一个只有一组数据的片段。但是,实际数据文件中有几个组。所以这只是一个缩短的简化示例(sample.csv
):
group,size,from,to
group32a4,0500,6sq2gp,m4qfce
group32a4,0800,oxlwtg,ru1u5r
group32a4,1200,rpziz0,oxlwtg
group32a4,1400,ru1u5r,fvvskj
group32a4,0500,m4qfce,60m2eq
group32a4,0050,fvvskj,6sq2gp
由于数据来自外部软件,因此我无法更改有关数据格式或数据布局的任何内容。因此,我需要导入数据以进行进一步的数据处理,并执行以下任务:
size
列中具有最大值。from
和to
列进行排列。我决定使用pandas
进行数据处理,因为真实的数据文件相当复杂,我想拥有其permormant功能。但是,如果有任何其他(更合适的)工具或方法使用其他Python模块,那么这些工具或方法就完全没问题了。
为了完成我做的第一项任务:
import pandas as pd
# open file and read data
with open('sample.csv') as f:
data = pd.read_csv(f)
# sort descending by columns `group` and `size`
# sorting descending because `df.drop_duplicates()` keeps first element by default
df_sorted = data.sort_values(['group', 'size'], ascending=False)
# drop duplicates in order to keep first entry only
one_entry = df_sorted.drop_duplicates('group')
# print handled data
print(one_entry)
这导致了所需的输出:
group size from to
3 group32a4 1400 ru1u5r fvvskj
所以,我需要完成第二项任务。由于以上所有数据处理都没有在现场完成,因此我可以在整个数据处理过程中访问所有数据阶段。
不幸的是,我对如何做到这一点一无所知。我对如何做到这一点有一些概念性的想法。 首先,我需要安排每个组子集的路由。在上面给出的示例中,将导致:
rpziz0 --> oxlwtg --> ru1u5r --> fvvskj --> 6sq2gp --> m4qfce --> 60m2eq
之后,我需要提取源和目的地并总结这样的路线:
rpziz0 --> 60m2eq
这应该导致整体输出:
group size from to
3 group32a4 1400 rpziz0 60m2eq
所以我提出的问题如下:
如何识别每个group
标签定义的每个子集的路径(最好使用pandas'方法)?
注意:使用Python 3.4.3,Pandas 0.17.1
答案 0 :(得分:0)
您可以将stack
与drop_duplicates
和pivot
一起使用。添加了下一组以进行更好的测试:
print df
group size from to
0 group32a4 500 6sq2gp m4qfce
1 group32a4 800 oxlwtg ru1u5r
2 group32a4 1200 rpziz0 oxlwtg
3 group32a4 1400 ru1u5r fvvskj
4 group32a4 500 m4qfce 60m2eq
5 group32a4 50 fvvskj 6sq2gp
6 group13a4 500 6sq2gp m4qfce
7 group13a4 800 oxlwtg ru1u5r
8 group13a4 1200 rpziz0 oxlwtg
9 group13a4 1400 ru1u5r fvvskj
10 group13a4 500 m4qfce 60m2eq
11 group13a4 50 fvvskj 6sq2gp
#set index and stack data - columns 'from' and 'to' to one column 'route'
df = df.set_index(['group', 'size']).stack().reset_index(name='route')
print df
group size level_2 route
0 group32a4 500 from 6sq2gp
1 group32a4 500 to m4qfce
2 group32a4 800 from oxlwtg
3 group32a4 800 to ru1u5r
4 group32a4 1200 from rpziz0
5 group32a4 1200 to oxlwtg
6 group32a4 1400 from ru1u5r
7 group32a4 1400 to fvvskj
8 group32a4 500 from m4qfce
9 group32a4 500 to 60m2eq
10 group32a4 50 from fvvskj
11 group32a4 50 to 6sq2gp
12 group13a4 500 from 6sq2gp
13 group13a4 500 to m4qfce
14 group13a4 800 from oxlwtg
15 group13a4 800 to ru1u5r
16 group13a4 1200 from rpziz0
17 group13a4 1200 to oxlwtg
18 group13a4 1400 from ru1u5r
19 group13a4 1400 to fvvskj
20 group13a4 500 from m4qfce
21 group13a4 500 to 60m2eq
22 group13a4 50 from fvvskj
23 group13a4 50 to 6sq2gp
def f(x):
#set column size to max
x['size'] = x['size'].max()
return x.drop_duplicates('route', keep=False)
#apply custom function f
df = df.groupby('group').apply(f).reset_index(drop=True)
print df
group size level_2 route
0 group13a4 1400 from rpziz0
1 group13a4 1400 to 60m2eq
2 group32a4 1400 from rpziz0
3 group32a4 1400 to 60m2eq
#reshape data, remove column tmp
df = df.pivot(index='group', columns='level_2').reset_index()
df.columns = ['group','size','tmp','from', 'to']
df = df.drop('tmp', axis=1)
print df
group size from to
0 group13a4 1400 rpziz0 60m2eq
1 group32a4 1400 rpziz0 60m2eq
编辑:
类似地,我认为更快的解决方案是使用groupby
函数f
和apply
iat
填充DataFrame,iloc
:
def f(x):
#get max of column size
m = x['size'].max()
#remove all duplicates - stay only one value from and one value to
x = x.drop_duplicates('route', keep=False)
x['group'] = x.iat[0, 0]
x['size'] = m
x['from'] = x.iat[0, 3]
x['to'] = x.iat[1, 3]
#print x
#return first row and columns group, size from to
#print x.iloc[0,[0,1,4,5]]
return x.iloc[0,[0,1,4,5]]
#apply custom function f
df = df.groupby('group').apply(f).reset_index(drop=True)
print df
group size from to
0 group13a4 1400 rpziz0 60m2eq
1 group32a4 1400 rpziz0 60m2eq