Question

我有两个以相同格式编制索引的系列。下面是两个剪辑（因为数据的大小我不会显示整个集合）：

>>> s1
Out[52]: 
parameter_id  parameter_type_cs_id
4959          1                        -0.2664122
4960          1                      -0.004289398
4961          1                      -0.006652875
4966          1                      -0.004208685
4967          1                       -0.02268688
4968          1                       -0.05958452
4969          1                       -0.01133198
4970          1                       -0.01968251
4972          1                       -0.05860331
4974          1                       -0.08260008
4975          1                       -0.05402012
4979          1                        -0.0308407
4980          1                       -0.02232495
4987          1                        -0.2315813
4990          1                       -0.02171027
...
727241        1                            -0.00156766
727242        1                          -0.0009964491
727243        1                           -0.007068732
727244        1                           -0.003500738
727245        1                           -0.006572505
727246        1                          -0.0005814131
728060        1                             -0.0144799
728062        1                             -0.0418521
728063        1                            -0.01367948
728065        1                            -0.03625054
728066        1                            -0.06806824
728068        1                           -0.007910916
728071        1                           -0.005482052
728073        1                           -0.005845178
intercept                             [-11.4551819018]
Name: coef, Length: 1529, dtype: object

>>> s2
Out[53]: 
parameter_id  parameter_type_cs_id
4958          1                       -0.001683882
4959          1                          -1.009859
4960          1                      -0.0004456379
4961          1                       -0.005564386
4963          1                         -0.9145955
4964          1                      -0.0009077246
4965          1                      -0.0003179153
4966          1                      -0.0006907124
4967          1                        -0.02125838
4968          1                        -0.02443978
4969          1                       -0.002665334
4970          1                       -0.003135213
4971          1                      -0.0003539563
4972          1                        -0.03684852
4973          1                      -0.0001203596
...
728044        1                          -0.0003084855
728060        1                              -0.925618
728061        1                           -0.001192743
728062        1                             -0.9203911
728063        1                           -0.002522615
728064        1                          -0.0003572484
728065        1                           -0.003475959
728066        1                            -0.02329697
728068        1                           -0.001412785
728069        1                           -0.002095895
728070        1                          -9.790675e-05
728071        1                          -0.0003013977
728072        1                          -0.0003369116
728073        1                           -0.000249748
intercept                             [-12.1281459287]
Name: coef, Length: 1898, dtype: object

索引格式是相同的，因此我尝试将它们放入如下的数据框中：

d = {'s1': s1, 's2': s2}
df = pd.DataFrame(d)

但是我注意到输出几乎都是NaN，我觉得这很令人震惊。我查看了单个系列的索引，并注意到数据框将它们作为字符串而不是与系列相同的格式

>>> s1.index.values
Out[54]: 
array([(4959, 1), (4960, 1), (4961, 1), ..., (728071, 1), (728073, 1),
       ('intercept', '')], dtype=object)

>>> s2.index.values
Out[55]: 
array([(4958, 1), (4959, 1), (4960, 1), ..., (728072, 1), (728073, 1),
       ('intercept', '')], dtype=object)

但数据框有字符串

>>> df.index.values
Out[56]: 
array([('4959', '1'), ('4960', '1'), ('4961', '1'), ..., ('8666', '1'),
       ('9638', '1'), ('intercept', '')], dtype=object)

为什么它会改变类型并导致我的问题......？

对我来说更奇怪的是，如果我在较小的集合上尝试与上面相同，我会看到我期望的行为（并非所有NaN并且索引未被转换）

s1_ = s1[:15]
s2_ = s2[:15]
d_ = {'s1': s1_, 's2': s2_}
df_ = pd.DataFrame(d_) #<---- This has the behavior I would expect

修改我找到了一种方法可行，但我不确定为什么它会像这样工作，如果我将两个系列转换为数据帧然后加入它们它按预期工作：

df_1 = pd.DataFrame({'s1': s1})
df_2 = pd.DataFrame({'s2': s2})
new_df = df_1.join(df_2) #WHY DOES THIS WAY WORK!?!?

Answer 1

我没有您的数据框，但这里有一个小数据示例，表明pandas按预期构建数据框（使用pandas 0.15.1和python 3.4）。正如所料，当指数不匹配时会引入NaN。

您数据的最后一行是（＆＃39;截取＆＃39;，＆＃39;＆＃39;），而所有其他行都是数字。所以（＆＃39;拦截＆＃39;，＆＃39;＆＃39;）进入每个系列的索引，这可能导致索引中的值被提升＆＃34;到字符串。

>> s1 = pd.Series([1,2,3], index=pd.MultiIndex.from_tuples([(1,1),(1,2),(1,3)], names=['a','b']))
>>> s1
a  b
1  1    1
   2    2
   3    3
dtype: int64
>>> s2 = pd.Series([100,200,300], index=pd.MultiIndex.from_tuples([(1,2),(1,3),(1,4)], names=['a','b']))
>>> 
>>> s2
a  b
1  2    100
   3    200
   4    300
dtype: int64
>>> df = pd.DataFrame({'s1':s1, 's2':s2})
>>> df
     s1   s2
a b         
1 1   1  NaN
  2   2  100
  3   3  200
  4 NaN  300
>>> df.index.values
array([(1, 1), (1, 2), (1, 3), (1, 4)], dtype=object)

Answer 2

将索引转换为字符串的原因是因为最后一个索引

intercept                             [-11.4551819018]

系列数据中的

是一个字符串。 Pandas数据框的文档指出，当从一个系列构造数据帧时，数据帧保持与系列相同的索引，这导致转换为所有字符串，因为数据中的最后一行。

创建两个数据框然后加入它们的解决方案是有效的，因为索引是一致的，因为您使用相同的数据结构（例如数据帧）而不是从一个数据结构（系列）转换为另一个（数据）帧）。这似乎是熊猫特有的事情。我会坚持你的解决方案。

从系列更改索引创建pandas数据框

2 个答案: