Question

我有两个pandas数据帧。一个包含实际数据，第二个包含行索引，我需要用一些值替换它。

Df1：输入记录

    A         B      record_id   record_type
0  12342345  10         011           H
1  65767454  20         012           I
2  78545343  30         013           I
3  43455467  40         014           I

Df2：信息包含需要更改的行索引（例如：此处为＃）

   Column1  Column2  Column3  record_id
0        1        2        4     011
1        1        2        None  012
2        1        2        4     013
3        1        2        None  014

输出结果：

   A          B         record_id  record_type
0  #          #         011           #
1  #          #         012           I
2  #          #         013           #
3  #          #         014           I

因此基于record_id查找并希望更改相应的行索引值。

Df2中存在的这里（1 2 4 011）包含有关我们想要从Df1修改id为011的特定记录的第一，第二和第四行索引的信息。

因此，在输出结果中，我们将行索引1,2,4的值id替换为行值011，并将值填充为＃。

请建议任何其他方法在熊猫中做同样的事。

Answer 1

首先，您可以进行一些预处理，以简化生活。将索引设置为record_id，然后将column3从df2重命名为record_type。现在，数据框具有相同的索引和列名称，便于自动对齐。

df1 = df1.set_index('record_id')
df2 = df2.set_index('record_id')
df2 = df2.rename(columns={'Column3':'record_type'})
df2 = df2.replace('None', np.nan)

然后我们可以用d2填写df2的缺失值，然后制作所有原始的非缺失值＆＃39;＃＆＃39;。

df2.fillna(df1).where(df2.isnull()).fillna('#')

          Column1 Column2 record_type
record_id                            
11              #       #           #
12              #       #           I
13              #       #           #
14              #       #           I

Answer 2

Df2中存在的这里（1 2 4 011）包含有关我们想要从Df1修改id为011的特定记录的第一，第二和第四行索引的信息。

这对我没有意义 - record_id = 011的行本身没有更多的行（你似乎想要选择第一，第二，第四行）。请使用您期望的精确结果完成输出值。

无论如何，我遇到了与标题中相同的问题，并解决了这个问题：

假设您有一个DataFrame df和三个同样长的向量rsel，csel（用于行/列选择器）和val（例如，长度为{{1} }}），并希望做相同的

然后，以下代码（至少）适用于df.lookup(rsel, csel) = val和pandas v.0.23，，假设python 3.6不包含重复项！

警告：这不适合大型数据集，因为它初始化了形状rsel维度的完整方形矩阵！

(N, N)

PS。如果您知道只有字符串作为列名，那么import pandas as pd import numpy as np from functools import reduce def coalesce(df, ltr=True): if not ltr: df = df.iloc[:, ::-1] # flip left to right # use iloc as safeguard against non-unique column names list_of_series = [df.iloc[:, i] for i in range(len(df.columns))] # this is like a SQL coalesce return reduce(lambda interm, x: interm.combine_first(x), list_of_series) # column names generally not unique! square = pd.DataFrame(np.diag(val), index=rsel, columns=csel) # np.diag creates 0s everywhere off-diagonal; set them to nan square = square.where(np.diag([True] * len(rsel))) # assuming no duplicates in rsel; this is empty upd = pd.DataFrame(index=rsel, columns=sorted(csel.unique())) # collapse square into upd upd = upd.apply(lambda col: coalesce(square[square.columns == col.name])) # actually update values df.update(upd)比square.filter(regex=col.name)快得多。

根据查找

2 个答案: