我正在尝试集成多个数据源,并且我发现简要地合并我的列有点棘手。我的数据框看起来像这样:
df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
print df
Address1 City1 State1 Zip1 Address2 City2/
0 Address Data City Data State Data Zip Data NaN NaN
1 NaN NaN NaN NaN Address Data City Data
2 NaN NaN NaN NaN NaN NaN
State2 Zip2 Address3 City3 State3 Zip3
0 NaN NaN NaN NaN NaN NaN
1 State Data Zip Data NaN NaN NaN NaN
2 NaN NaN Address Data City Data State Data Zip Data
我试图结合这样的事情:
#run code here to combine multiple versions of the columns
print df
Address City State Zip
0 Address Data City data State Data Zip Data
1 Address Data City data State Data Zip Data
2 Address Data City data State Data Zip Data
我希望我可以使用这样的行一次设置多个列:
df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values
但是,.loc不允许您一次创建多个列。任何关于如何组合数据的想法都将非常感激!
根据以下解决方案进行修改
该解决方案的问题是我的数据实际上可能看起来更像这样:
df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
在这种情况下,下面的解决方案给了我4行,当我只想要原始的3.我要说的是#34;如果第1部分有数据,请使用第1部分,否则如果第2部分有数据使用第2节,否则如果第3节有数据,则使用第3和第34节。每行都需要保持唯一,因为我没有合并行的许多其他属性。谢谢!
答案 0 :(得分:1)
IIUC您可以先按list
理解选择列名,然后使用pd.lreshape
:
a = ([col for col in df.columns if col.startswith('Address')])
c = ([col for col in df.columns if col.startswith('City')])
s = ([col for col in df.columns if col.startswith('State')])
z = ([col for col in df.columns if col.startswith('Zip')])
print (a)
print (c)
print (s)
print (z)
['Address1', 'Address2', 'Address3']
['City1', 'City2', 'City3']
['State1', 'State2', 'State3']
['Zip1', 'Zip2', 'Zip3']
df1 = pd.lreshape(df, {'Address':a, 'City' :c, 'State':s, 'Zip' :z})
print (df1)
Address State City Zip
0 Address Data State Data City data Zip Data
1 Address Data State Data City data Zip Data
2 Address Data State Data City data Zip Data
编辑:
如果您需要 NOT 删除NaN
个值的行,请添加参数dropna=False
:
df1 = pd.lreshape(df, {'Address':a, 'City' :c, 'State':s, 'Zip' :z}, dropna=False)
print (df1)
Address State City Zip
0 Address Data State Data City data Zip Data
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 Address Data State Data City data Zip Data
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 Address Data State Data City data Zip Data
另一个numpy解决方案,但数据可以很好地排序(样本是,实际上可能不是):
print (pd.DataFrame((df.values.reshape(9,4)), columns=['Address','City','State','Zip'])
.dropna())
Address City State Zip
0 Address Data City data State Data Zip Data
4 Address Data City data State Data Zip Data
8 Address Data City data State Data Zip Data
答案 1 :(得分:1)
我认为解决方案完全不同,所以我决定创建新答案:
import pandas as pd
import numpy as np
#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']
df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
a1 b1 c1 a2 b2 c2 a3 b3 c3
0 5.0 8.0 9.0 5 0 0 1 7 6
1 NaN NaN NaN 5 2 4 2 4 7
2 NaN NaN NaN 7 0 6 9 9 7
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)
#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c
0 5.0 8.0 9.0
1 5.0 2.0 4.0
2 9.0 9.0 7.0
编辑:我尝试修改你的样本:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['Address Data1','City data1','State Data1', 'Zip Data1', 'Address Data2','City data2','State Data2', 'Zip Data2',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data3','City data3','State Data3', 'Zip Data3',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data4','City data4','State Data4', 'Zip Data4']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
print (df1)
Address1 City1 State1 Zip1 Address2 \
0 Address Data1 City data1 State Data1 Zip Data1 Address Data2
1 NaN NaN NaN NaN Address Data3
2 NaN NaN NaN NaN NaN
City2 State2 Zip2 Address3 City3 State3 \
0 City data2 State Data2 Zip Data2 NaN NaN NaN
1 City data3 State Data3 Zip Data3 NaN NaN NaN
2 NaN NaN NaN Address Data4 City data4 State Data4
Zip3
0 NaN
1 NaN
2 Zip Data4
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)
#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
Address City State Zip
0 Address Data1 City data1 State Data1 Zip Data1
1 Address Data3 City data3 State Data3 Zip Data3
2 Address Data4 City data4 State Data4 Zip Data4
答案 2 :(得分:0)
这是怎么回事?
uniq_cols = df.columns.str.replace(r'\d+$', '').unique()
new = pd.DataFrame(columns=uniq_cols)
for col in uniq_cols:
new[col] = df.filter(like=col).apply(lambda x: x.dropna()[0], axis=1)
输出:
In [282]: new
Out[282]:
Address City State Zip
0 Address Data City data State Data Zip Data
1 Address Data City data State Data Zip Data
2 Address Data City data State Data Zip Data
说明:
In [283]: uniq_cols
Out[283]: array(['Address', 'City', 'State', 'Zip'], dtype=object)
In [284]: df.filter(like='Address').apply(lambda x: x.dropna()[0], axis=1)
Out[284]:
0 Address Data
1 Address Data
2 Address Data
dtype: object