Python大熊猫,将一个表中的多列与另一表中的一列合并

时间:2019-12-22 12:45:20

标签: python pandas

假设我有这样的表:

+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country  | 1          | NAN      | NAN        | NAN      | NAN        | 100   |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 20    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 80    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 4          | sex      | 6          | 40    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 5          | sex      | 7          | 60    |
+----------+------------+----------+------------+----------+------------+-------+

并且我想LEFT JOIN并在panadas的a_name列上使用下表:

+----+---------+-------+-------+-------+
| id | a_name  | c01   | c02   | c03   |
+----+---------+-------+-------+-------+
| 1  | country | dtr1  | dtr2  | dtr3  |
+----+---------+-------+-------+-------+
| 2  | region  | dtc1  | dtc2  | dtc3  |
+----+---------+-------+-------+-------+
| 3  | city    | dta1  | dta2  | dta3  |
+----+---------+-------+-------+-------+
| 4  | age     | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5  | sex     | dts1  | dts2  | dts3  |
+----+---------+-------+-------+-------+

我想向第一个表的c01, c02 and c03列中出现的每个值(country ,region, city, age,sex)添加a_name_0, a_name_1 and a_name_2列。

显然,我需要为出现在a_name_0, a_name_1 and a_name_2列中的每个值添加三个新列,否则我的表将具有不同数量的行。其余的行值应为空,或者为NA或NAN ..无论如何。

预期输出:

+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country  | dtCo1 | dtCo2 | dtCo3 | 1          | NAN      | NAN   | NAN   | NAN   | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 70    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   | dtr1  | dtr2  | dtr2  | 2          | city     | dtc1  | dtc2  | dtc3  | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 2          | city     |       |       |       | NAN        | NAN      |       |       |       | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 4          | sex      |       |       |       | 6          | 40    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 5          | sex      |       |       |       | 7          | 60    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+

说明:

我正在构建数据仓库表,该表将用于数据分析目的。报价表(第一个表)应填充有需要直观表示的各种项目报价信息(表2)。

2 个答案:

答案 0 :(得分:1)

使用外部联接合并数据框,并指定要在其上联接表的列(来自每个数据框)。

# Sample data
>>> A
  name_1 name_2  values
0      a      b       1
1      b      c       2
2      c      b       3
3      d      a       4

>>> B
  name  values
0    a       1
1    b       2
2    c       3

>>> C
  name  values
0    a      10
1    b      20
2    c      30

使用merge()方法,您可以指定要合并其数据框的列。将how参数设置为outer将指定外部联接,这将用NaN填充不匹配的数据点。

# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
  name_1 name_2  values_x name  values_y
0      a      b         1    a       1.0
1      b      c         2    b       2.0
2      c      b         3    c       3.0
3      d      a         4  NaN       NaN

>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
  name_1 name_2  values_x name_x  values_y name_y  values
0      a      b         1      a       1.0      b      20
1      c      b         3      c       3.0      b      20
2      b      c         2      b       2.0      c      30
3      d      a         4    NaN       NaN      a      10

答案 1 :(得分:1)

使用:

#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)

#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
    #change columns names for match and for avoid duplicated columns names
    df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
    #left join
    x = x.merge(df2, on=f'a_name_{i}', how='left')
    #convert duplicates by a_name columns to NaNs
    m = x.duplicated(subset=[x.columns[0]])
    x.iloc[m.to_numpy(), 2:] = np.nan
    #convert id_qname columns to end
    x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
    #append to list
    dfs.append(x)

#join together and last add count column from index 
df = pd.concat(dfs, axis=1).assign(count=df1.index)

print (df)
  a_name_0 c01_0 c02_0 c03_0  id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1  \
0  country  dtr1  dtr2  dtr3           1      NaN     NaN     NaN     NaN   
1   region  dtc1  dtc2  dtc3           2     city    dta1    dta2    dta3   
2   region   NaN   NaN   NaN           2     city     NaN     NaN     NaN   
3   region   NaN   NaN   NaN           3      age   dtCo1   dtCo2   dtCo3   
4   region   NaN   NaN   NaN           3      age     NaN     NaN     NaN   

   id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2  id_qname_2  count  
0         NaN      NaN       NaN       NaN       NaN         NaN    100  
1         NaN      NaN       NaN       NaN       NaN         NaN     20  
2         NaN      NaN       NaN       NaN       NaN         NaN     80  
3         4.0      sex      dts1      dts2      dts3         6.0     40  
4         5.0      sex       NaN       NaN       NaN         7.0     60