假设我有这样的表:
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
并且我想LEFT JOIN
并在panadas的a_name
列上使用下表:
+----+---------+-------+-------+-------+
| id | a_name | c01 | c02 | c03 |
+----+---------+-------+-------+-------+
| 1 | country | dtr1 | dtr2 | dtr3 |
+----+---------+-------+-------+-------+
| 2 | region | dtc1 | dtc2 | dtc3 |
+----+---------+-------+-------+-------+
| 3 | city | dta1 | dta2 | dta3 |
+----+---------+-------+-------+-------+
| 4 | age | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5 | sex | dts1 | dts2 | dts3 |
+----+---------+-------+-------+-------+
我想向第一个表的c01, c02 and c03
列中出现的每个值(country ,region, city, age,sex
)添加a_name_0, a_name_1 and a_name_2
列。
显然,我需要为出现在a_name_0, a_name_1 and a_name_2
列中的每个值添加三个新列,否则我的表将具有不同数量的行。其余的行值应为空,或者为NA或NAN ..无论如何。
预期输出:
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country | dtCo1 | dtCo2 | dtCo3 | 1 | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | 70 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | dtr1 | dtr2 | dtr2 | 2 | city | dtc1 | dtc2 | dtc3 | NAN | NAN | NAN | NAN | NAN | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 2 | city | | | | NAN | NAN | | | | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 4 | sex | | | | 6 | 40 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 5 | sex | | | | 7 | 60 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
说明:
我正在构建数据仓库表,该表将用于数据分析目的。报价表(第一个表)应填充有需要直观表示的各种项目报价信息(表2)。
答案 0 :(得分:1)
使用外部联接合并数据框,并指定要在其上联接表的列(来自每个数据框)。
# Sample data
>>> A
name_1 name_2 values
0 a b 1
1 b c 2
2 c b 3
3 d a 4
>>> B
name values
0 a 1
1 b 2
2 c 3
>>> C
name values
0 a 10
1 b 20
2 c 30
使用merge()
方法,您可以指定要合并其数据框的列。将how
参数设置为outer
将指定外部联接,这将用NaN
填充不匹配的数据点。
# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
name_1 name_2 values_x name values_y
0 a b 1 a 1.0
1 b c 2 b 2.0
2 c b 3 c 3.0
3 d a 4 NaN NaN
>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
name_1 name_2 values_x name_x values_y name_y values
0 a b 1 a 1.0 b 20
1 c b 3 c 3.0 b 20
2 b c 2 b 2.0 c 30
3 d a 4 NaN NaN a 10
答案 1 :(得分:1)
使用:
#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)
#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
#change columns names for match and for avoid duplicated columns names
df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
#left join
x = x.merge(df2, on=f'a_name_{i}', how='left')
#convert duplicates by a_name columns to NaNs
m = x.duplicated(subset=[x.columns[0]])
x.iloc[m.to_numpy(), 2:] = np.nan
#convert id_qname columns to end
x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
#append to list
dfs.append(x)
#join together and last add count column from index
df = pd.concat(dfs, axis=1).assign(count=df1.index)
print (df)
a_name_0 c01_0 c02_0 c03_0 id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1 \
0 country dtr1 dtr2 dtr3 1 NaN NaN NaN NaN
1 region dtc1 dtc2 dtc3 2 city dta1 dta2 dta3
2 region NaN NaN NaN 2 city NaN NaN NaN
3 region NaN NaN NaN 3 age dtCo1 dtCo2 dtCo3
4 region NaN NaN NaN 3 age NaN NaN NaN
id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2 id_qname_2 count
0 NaN NaN NaN NaN NaN NaN 100
1 NaN NaN NaN NaN NaN NaN 20
2 NaN NaN NaN NaN NaN NaN 80
3 4.0 sex dts1 dts2 dts3 6.0 40
4 5.0 sex NaN NaN NaN 7.0 60