我需要在数据框中的列之间映射值。我的数据框中的值类型为list。这是我的数据框的具体示例:
date tablenameFrom tablenameJoin attribute
1 01-03-2019 [item] [] [itemID, itemName]
2 02-03-2019 [order as o] [customer as c, payment as p] [o.oderID, p.paymentID,c.firstName]
3 03-03-2019 [customer c] [order o] [c.customerID, o.orderID]
这是我尝试过的:How to match the value of the column between columns in a dataframe, if there is an empty column。但是解决方案无法处理数据框的第一行,即该行具有多个属性名称。
这是我尝试过的:
import numpy as np
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([pd.DataFrame({x: np.concatenate(df[x].values)}) for x
in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
df['tablename']=df['tablenameFrom']+df['tablenameJoin']
yourdf=unnesting(df[['date','tablename','attribute']], ['tablename','attribute'])
yourdf['tablename']=yourdf['tablename'].str.split().str[0]# fix the output
yourdf['attribute']=yourdf['attribute'].str.split(r'[''|.]').str[-1]
yourdf
我收到错误消息:ValueError: Length mismatch: Expected axis has 561 elements, new values have 412 elements
我需要将属性映射到表名。我的预期结果:
date tablename attributename
1 01-03-2019 item itemID
2 01-03-2019 item itemName
3 02-03-2019 order orderID
4 30-03-2019 customer firstName
5 30-03-2019 payment paymentID
6 31-03-2019 customer customerID
7 31-03-2019 order orderID
我想知道是否有人可以给我一些解决方案。非常感谢。
答案 0 :(得分:1)
将zip_longest
与None
一起用于缺失值,最后每组分别按ffill
和bfill
替换它们:
d = {'date': ['29-03-2019', '30-03-2019', '31-03-2019'],
'tablenameFrom': [['customer'], ['order as o'], ['customer']],
'tablenameJoin': [[], ['customer as c', 'payment as p'], ['order']],
'attribute': [['customerID', 'another'], ['o.oderID', 'p.paymentID', 'c.firstName'],
['customerID', 'orderID']]}
df = pd.DataFrame(d, index=[1,2,3])
print (df)
date tablenameFrom tablenameJoin \
1 29-03-2019 [customer] []
2 30-03-2019 [order as o] [customer as c, payment as p]
3 31-03-2019 [customer] [order]
attribute
1 [customerID, another]
2 [o.oderID, p.paymentID, c.firstName]
3 [customerID, orderID]
from itertools import zip_longest
x = df['attribute']
y = df['tablenameFrom'] + df['tablenameJoin']
a = [(m, l, k) for m, (i, j)
in enumerate(zip_longest(x, y, fillvalue=[None]))
for k, l
in zip_longest(i, j, fillvalue=None)]
#print (a)
df1 = pd.DataFrame(a, columns=['date','tablename','attributename'])
df1['date'] = df['date'].values[df1['date'].values]
df1['tablename'] = df1.groupby('date')['tablename'].apply(lambda x: x.ffill().bfill())
使用map
可获得正确的匹配值:
df2 = df1['tablename'].str.split(' as ', expand=True)
s = df2.dropna().drop_duplicates(1).set_index(1)[0]
print (s)
1
o order
c customer
p payment
Name: 0, dtype: object
df1['attributename'] = df2[1].map(s).fillna(df1['attributename'])
df1['tablename'] = df2[0]
print (df1)
date tablename attributename
0 29-03-2019 customer customerID
1 29-03-2019 customer another
2 30-03-2019 order order
3 30-03-2019 customer customer
4 30-03-2019 payment payment
5 31-03-2019 customer customerID
6 31-03-2019 order orderID
答案 1 :(得分:0)
d = {'date': ['29-03-2019', '30-03-2019', '31-03-2019'],
'tablenameFrom': [['item'], ['order as o'], ['customer']],
'tablenameJoin': [[], ['customer as c', 'payment as p'], ['order']],
'attribute': [['customerID', 'another'], ['o.oderID', 'p.paymentID', 'c.firstName'],
['customerID', 'orderID']]}
d = pd.DataFrame(d, index=[1,2,3])
>>> d
date tablenameFrom tablenameJoin attribute
1 29-03-2019 [item] [] [customerID, another]
2 30-03-2019 [order as o] [customer as c, payment as p] [o.oderID, p.paymentID, c.firstName]
3 31-03-2019 [customer] [order] [customerID, orderID]
df_list = []
cols = d.columns
for col in d.columns:
df_ = d[col].apply(pd.Series).stack().reset_index().set_index('level_0')
df_ = df_.drop(columns='level_1')
df_list.append(df_)
nw_df = df_list[0]
for df_ in df_list[1:]:
nw_df = pd.merge(nw_df,df_,on='level_0',how='outer')
nw_df.columns = cols
>>> nw_df