像df.A = sr
这样的简单操作(将pandas.Series
分配给pandas.DataFrame
中的列)似乎无害,但有许多细微差别。对于像我这样开始学习pandas
的人来说,它带来了许多便利和困惑。
下面给出一个简单的示例/挑战:
df:
+----+-----+
| | A |
|----+-----|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+----+-----+
l = [777, 666, 555, 444, 333]
sr:
+----+-----+
| | 0 |
|----+-----|
| 7 | 777 |
| 6 | 666 |
| 5 | 555 |
| 4 | 444 |
| 3 | 333 |
+----+-----+
在df
之后,df.A = sr
是什么样?
或
df
之后,df.A = l
会是什么样?
根据我目前的理解,我分解了df.A = sr
中的所有隐含操作,请对其进行更正/确认/扩展:
例如,我不太确定正确的术语。
# [0] a column in a DataFrame, is a Series, is a dictionary of index and values
# all cell to cell transfers are key-lookup based, individual element in an
# index is called a "label" for a reason.
# [1] if sr didn't have some of the index labels in df.col's index,
# the old values in those cells in df.col gets WIPED!
df.loc[ ~df.index.isin(sr.index)] = np.nan
# [2] values are transferred from sr cells into df cells with common index-labels.
# As expected
df.loc[ df.index.isin(sr.index), 'A'] =
sr.loc[ [idx for idx in sr.index if idx in df.index] ]
# [3] sr's cells, whoes index-lables are not found in df.index, are ignored and
# doesn't get to be assigned in df
sr.loc[ ~sr.index.isin(df.index)] # goes no where.
# [4] with all the wipping and ignore from above steps,
# there is no error message or warnings.
# it can cause your mistakes to slip thru:
"""
df = pd.DataFrame(0, columns=['A'], index=np.arange(5))
df.loc[ df.index.isin( ['A', 'B']), 'A'] = sr
print(df)
df = pd.DataFrame(0, columns=['A'], index=[])
df.A = sr
print(df)
"""
SPOILER。设置和结果:
df = pd.DataFrame(0, columns=['A'], index=np.arange(5))
l = [777, 666, 555, 444, 333]
sr = pd.Series(l, index=[7, 6, 5, 4, 3])
RESULTS:
df.A = sr
df:
+----+-----+
| | A |
|----+-----|
| 0 | nan |
| 1 | nan |
| 2 | nan |
| 3 | 333 |
| 4 | 444 |
+----+-----+
df.A = l
df:
+----+-----+
| | A |
|----+-----|
| 0 | 777 |
| 1 | 666 |
| 2 | 555 |
| 3 | 444 |
| 4 | 333 |
+----+-----+
答案 0 :(得分:2)
所以您看到的结果是由于以下原因:
sr = pd.Series(l, index=[7, 6, 5, 4, 3])
您已将l的索引值专门分配给[7、6、5、4、3]。
当您这样做:
df.A = sr
该系列保留其索引值。然后,当您定义df时:
df = pd.DataFrame(0, columns=['A'], index=np.arange(5))
您确保最高索引值为4(index=np.arange(5)
)
因此您的列输出保留了sr的索引值,并将值放在A中,因此仅显示了索引值3,4。
当您这样做时:
df.A = l
您只需将l中的值分配给A列。因此所有值都将出现。如果您将sr = pd.Series(l, index=[7, 6, 5, 4, 3])
更改为sr = pd.Series(l)
,请设置df.A = sr
。您最终将得到与df.A = l
完全相同的结果。