我正在尝试使用pandas修改一个半大(约5k条目)的数据集。不幸的是,由于第一列用作组标识符并具有一些空单元格,因此我无法找到正确处理数据的方法,如下所示:
Column1----------Column2-------------Column3
<br>USER1-------details on user1---------more details on user 1</br>
<br>N/A-----------details on user1---------more details on user 1</br>
<br>N/A-----------details on user1---------more details on user 1</br>
<br>N/A-----------details on user1---------more details on user 1</br>
<br>N/A-----------details on user1---------more details on user 1</br>
<br>USER2--------details on user2---------more details on user 2</br>
<br>N/A-----------details on user2---------more details on user 2</br>
<br>N/A-----------details on user2---------more details on user 2</br>
<br>N/A-----------details on user2---------more details on user 2</br>
<br>N/A-----------details on user2---------more details on user 2</br>
不幸的是,在这种情况下,在熊猫中使用df.groupby()的方法无法正常工作,因为它无法正确分配值。
一种方法是简单地用初始数据集中的相应用户标识符替换“无”;但是,这会使数据集的可读性降低(我从Google Spreadsheets中提取了xlsx,用pandas对其进行了修改,然后将其重新发布到Google Spreadsheets中,以便在那里可以使用它)。
我的(最佳)工作流程如下:1.获取具有上述结构的数据集。 2.将第二个数据集与1中的数据集合并(使用第1列中的用户凭据作为索引)。
答案 0 :(得分:0)
将数据框放入大熊猫后,您可以复制Column1
:
df['column1_v2'] = df.column1
然后您可以pad
column1_v2
,以便可以在合并中使用它:
df.column1_v2 = df.column1_v2.fillna(method='pad')
df = pd.merge(df, df2, how='left', on='column1_v2')
最后,在将其返回到xlxs
和Google表格之前,您只需删除为合并创建的列:
df = df.drop('column1_v2', axis=1)
答案 1 :(得分:0)
这就是我要做的。对于5k条目,性能应该不错。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'column1' : ['User1', np.nan, np.nan, np.nan,
'User2', np.nan, np.nan, np.nan],
'column2' : ['details user 1(1)','details user 1(2)',
'details user 1(3)','details user 1(4)',
'details user 2(1)','details user 2(2)',
'details user 2(3)','details user 2(4)'],
'column3' : ['more details user 1(1)','more details user 1(2)',
'more details user 1(3)','more details user 1(4)',
'more details user 2(1)','more details user 2(3)',
'more details user 2(3)','more details user 2(4)']})
print(df1)
# column1 column2 column3
#0 User1 details user 1(1) more details user 1(1)
#1 NaN details user 1(2) more details user 1(2)
#2 NaN details user 1(3) more details user 1(3)
#3 NaN details user 1(4) more details user 1(4)
#4 User2 details user 2(1) more details user 2(1)
#5 NaN details user 2(2) more details user 2(3)
#6 NaN details user 2(3) more details user 2(3)
#7 NaN details user 2(4) more details user 2(4)
def rename_column1(df1):
list1 = []
temp = []
for r in zip(df1['column1']):
if r[0] is not np.nan:
list1.append(r[0])
temp = r[0]
if r[0] is np.nan:
list1.append(temp)
df1['column1'] = list1
return df1
rename_column1(df1)
print(df1)
# column1 column2 column3
#0 User1 details user 1(1) more details user 1(1)
#1 User1 details user 1(2) more details user 1(2)
#2 User1 details user 1(3) more details user 1(3)
#3 User1 details user 1(4) more details user 1(4)
#4 User2 details user 2(1) more details user 2(1)
#5 User2 details user 2(2) more details user 2(3)
#6 User2 details user 2(3) more details user 2(3)
#7 User2 details user 2(4) more details user 2(4)
df1 = df1.groupby(['column1'], as_index = False, sort = False).agg(', '.join)
print(df1)
# column1 column2 column3
# 0 User1 details user 1(1), details user 1(2), details ... more details user 1(1), more details user 1(2)...
# 1 User2 details user 2(1), details user 2(2), details ... more details user 2(1), more details user 2(3)...
df2 = pd.DataFrame({'column1': ['User1','User2'],
'new_c2' : [0,0],
'new_c3' : [0,0]})
print(df2)
# column1 new_c2 new_c3
#0 User1 0 0
#1 User2 0 0
df3 = pd.merge(df1, df2, on = 'column1', how = 'left')
print(df3)
# column1 column2 column3 new_c2 new_c3
# 0 User1 details user 1(1), details user 1(2), details ... more details user #1(1), more details user 1(2)... 0 0
# 1 User2 details user 2(1), details user 2(2), details ... more details user #2(1), more details user 2(3)... 0 0