我有一个传递熊猫数据框的函数,我想为该数据框的每一行创建N个其他行,除2个列值外,每个行都与原始行等效。
执行此操作的正确方法是什么-尤其是以RAM有效的方式?
到目前为止,我一直在尝试运行pd.apply
,然后针对其中的每一行调用一个函数,该函数返回一个pd.Series
对象的列表,然后我将调用append
将它们添加到原始DataFrame
中。但是,这还没有解决。
这是我尝试使用一些虚拟代码进行复制的示例:
students = [ ('Jack', 34, 'Sydney' , 'Australia') ,
('Jill', 30, 'New York' , 'USA' ) ]
# Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'], index=['a', 'b', 'c' , 'd' , 'e' , 'f'])
# function I will use to explode a single row into 10 new rows
def replicate(x):
new_rows = []
i = 0
for j in range(3):
y = x.copy(deep=True)
y.Age = i
i += 1
new_rows.append(y)
return new_rows
# Iterate over each row and append the results
df.apply(lambda x: df.append(replicate(x))
对于上述情况,我希望输出如下所示:
Jack, 34, Sydney, Australia
Jack, 0, Sydney, Australia
Jack, 1, Sydney, Australia
Jack, 2, Sydney, Australia
Jill, 30, New York, USA
Jill, 0, New York, USA
Jill, 1, New York, USA
Jill, 2, New York, USA
最后,我希望数据框的行数是N倍,在这里我可以计算原始行的派生行。我想以节省空间的方式做到这一点,而且目前还没有发生。任何帮助表示赞赏!
答案 0 :(得分:0)
您可以将数据框放入列表中,然后执行所需的任何操作:
# x5 row duplicate
df = df.append([df]*5, ignore_index=True)
df.sort_values(by='Name').head(15)
# Result
Name Age City Country
28 John 16 New York US
4 John 16 New York US
22 John 16 New York US
34 John 16 New York US
16 John 16 New York US
10 John 16 New York US
17 Mike 17 las vegas US
29 Mike 17 las vegas US
23 Mike 17 las vegas US
11 Mike 17 las vegas US
35 Mike 17 las vegas US
5 Mike 17 las vegas US
3 Neelu 32 Bangalore India
33 Neelu 32 Bangalore India
15 Neelu 32 Bangalore India
答案 1 :(得分:0)
IIUC,您想要np.repeat
,使用Age
列指定重复次数,然后在该事实之后修改age列。
import pandas as pd
df1 = pd.DataFrame(df.values.repeat(df.Age+1, axis=0),
columns=['Name', 'Age', 'City', 'Country'])
df1['Age'] = (df1.groupby([*df1]).cumcount()-1).where(df1.duplicated(), df1['Age'])
df1
: Name Age City Country
0 Jack 34 Sydney Australia
1 Jack 0 Sydney Australia
2 Jack 1 Sydney Australia
3 Jack 2 Sydney Australia
4 Jack 3 Sydney Australia
...
34 Jack 33 Sydney Australia
35 Jill 30 New York USA
...
63 Jill 27 New York USA
64 Jill 28 New York USA
65 Jill 29 New York USA
[66 rows x 4 columns]
df
: Name Age City Country
a Jack 34 Sydney Australia
b Jill 30 New York USA
答案 2 :(得分:0)
IIUC
d={x : y.set_index('Age').reindex(range(y['Age'].iloc[0]+1),method='bfill') for x , y in df.groupby(level=0)}
newdf=pd.concat(d).reset_index(level=1)
newdf
Out[220]:
Age Name City Country
a 0 Jack Sydney Australia
a 1 Jack Sydney Australia
a 2 Jack Sydney Australia
a 3 Jack Sydney Australia
a 4 Jack Sydney Australia
a 5 Jack Sydney Australia
a 6 Jack Sydney Australia
a 7 Jack Sydney Australia
a 8 Jack Sydney Australia
a 9 Jack Sydney Australia
a 10 Jack Sydney Australia
a 11 Jack Sydney Australia
a 12 Jack Sydney Australia
a 13 Jack Sydney Australia
a 14 Jack Sydney Australia
a 15 Jack Sydney Australia
a 16 Jack Sydney Australia
a 17 Jack Sydney Australia
a 18 Jack Sydney Australia
a 19 Jack Sydney Australia
a 20 Jack Sydney Australia
a 21 Jack Sydney Australia
a 22 Jack Sydney Australia
a 23 Jack Sydney Australia
a 24 Jack Sydney Australia
a 25 Jack Sydney Australia
a 26 Jack Sydney Australia
a 27 Jack Sydney Australia
a 28 Jack Sydney Australia
a 29 Jack Sydney Australia
.. ... ... ... ...
b 1 Jill New York USA
b 2 Jill New York USA
b 3 Jill New York USA
b 4 Jill New York USA
b 5 Jill New York USA
b 6 Jill New York USA
b 7 Jill New York USA
b 8 Jill New York USA
b 9 Jill New York USA
b 10 Jill New York USA
b 11 Jill New York USA
b 12 Jill New York USA
b 13 Jill New York USA
b 14 Jill New York USA
b 15 Jill New York USA
b 16 Jill New York USA
b 17 Jill New York USA
b 18 Jill New York USA
b 19 Jill New York USA
b 20 Jill New York USA
b 21 Jill New York USA
b 22 Jill New York USA
b 23 Jill New York USA
b 24 Jill New York USA
b 25 Jill New York USA
b 26 Jill New York USA
b 27 Jill New York USA
b 28 Jill New York USA
b 29 Jill New York USA
b 30 Jill New York USA
[66 rows x 4 columns]