Question

对于给定的数据框......

data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])

看起来像这样......

    0       1
0   1.0     6.5
1   1.0     NaN
2   5.0     3.0
3   6.5     3.0
4   2.0     NaN

...我想创建第三列，其中第二列的所有缺失都用连续数字替换。所以结果应该是这样的：

    0       1     2
0   1.0     6.5   NaN
1   1.0     NaN   1
2   5.0     3.0   NaN
3   6.5     3.0   NaN
4   2.0     NaN   2

（我的数据框有更多行，所以想象第二列中的70个缺失，以便第3列中的最后一个数字为70）

如何创建第3列？

Answer 1

你可以这样做，我冒昧地重命名列，以避免混淆我选择的内容，你可以使用以下数据框执行相同的操作：

data = data.rename(columns={0:'a',1:'b'})

In [41]:

data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
     a    b   c
0  1.0  6.5 NaN
1  1.0  NaN   1
2  5.0  3.0 NaN
3  6.5  3.0 NaN
4  2.0  NaN   2

[5 rows x 3 columns]

这里有一个班轮的解释：

# we want just the rows where column 'b' is null:
data[data.b.isnull()]

# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end

# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)

# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)

修改

您也可以使用@ Karl.D的建议另作：

In [56]: data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull()) data Out[56]: a b c 0 1.0 6.5 NaN 1 1.0 NaN 1 2 5.0 3.0 NaN 3 6.5 3.0 NaN 4 2.0 NaN 2 [5 rows x 3 columns]

Timings还建议Karl的方法对于更大的数据集会更快，但我会对此进行分析：

In [57]: %timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True) %timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull()) 1000 loops, best of 3: 1.31 ms per loop 1000 loops, best of 3: 501 µs per loop

使用连续数字填充非连续缺失

1 个答案: