Question

我对由稀疏序列组成的熊猫数据帧感到有些奇怪。我可以用填充和类型的值字典来制作DF，没问题，但是当我尝试对该DF进行子集化时，我得到了一些非常奇怪的结果。我能够再现地显示的是，当对由稀疏序列创建的DF进行子集化时，如果某列在整个过程中恰好是相同的（即所有整数都与填充值匹配），则子集DF会将那些列转换为NaN，并且dtype转换为float64。这是下面的一段测试代码：

import random
import numpy as np
import multiprocessing as mp
import pandas as pd


TEST_LINES = 10


samezeroint = [0 for i in range(TEST_LINES)]
sameoneint = [1 for i in range(TEST_LINES)]
samezerofloat = [0.0 for i in range(TEST_LINES)]
sameonefloat = [1.0 for i in range(TEST_LINES)]
indexone = [i for i in range(TEST_LINES)]

randomint = []
randomfloat = []

for i in range(TEST_LINES):
    randomint.append(random.randint(0,100))
    randomfloat.append(random.random())

testdict = {'indexone': indexone, "samezeroint": samezeroint, 'sameoneint': sameoneint, 'samezerofloat': samezerofloat, 'sameonefloat': sameonefloat, 'randomint': randomint, 'randomfloat': randomfloat}
filldict = {'indexone': 0, "samezeroint": 0, 'sameoneint': 1, 'samezerofloat': 0.0,
            'sameonefloat': 1.0, 'randomint': random.randint(0,100), 'randomfloat': random.random()}
dtypedict = {'indexone': np.int8, "samezeroint": np.int8, 'sameoneint': np.int8, 'samezerofloat': np.float,
            'sameonefloat': np.float, 'randomint': np.int8, 'randomfloat': np.float}


dospar = {}
for l in testdict:
    try:
        fill = filldict[l]
    except KeyError:
        fill = None
    try:
        datatype = dtypedict[l]
    except KeyError:
        datatype = np.str
    if fill is None:
        sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
    else:
        sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
    dospar[l] = sparr
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

# Test a single series

print("\n\nSeries: All zeroes")
samezerointseries = pd.Series(pd.SparseArray(testdict['samezeroint'], fill_value=0, dtype=np.int8))
print("\nOriginal")
print(samezerointseries)
samezero = samezerointseries.isin([0])
samezerozero = samezerointseries[samezero]
print("\nFiltered: should be identical to above")
print(samezerozero)
sameone = samezerointseries.isin([1])
samezeroone = samezerointseries[sameone]
print("\nFiltered: should be empty")
print(samezeroone)


print("\n\nDataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(testdf)
    print(testdf.dtypes)

print("\n\nDataframe: should be identical to above")
intone = testdf.loc[:, 'sameoneint'].isin([int(1)])
print(intone)
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(onedf)
    print(onedf.dtypes)

运行此测试时，我得到以下结果：



Series: All zeroes

Original
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: Sparse[int8, 0]

Filtered: should be identical to above
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: Sparse[int8, 0]

Filtered: should be empty
Series([], dtype: Sparse[int8, 0])


Dataframe:
   indexone  samezeroint  sameoneint  samezerofloat  sameonefloat  randomint  \
0         0            0           1            0.0           1.0         60   
1         1            0           1            0.0           1.0         68   
2         2            0           1            0.0           1.0         65   
3         3            0           1            0.0           1.0        100   
4         4            0           1            0.0           1.0         53   
5         5            0           1            0.0           1.0         26   
6         6            0           1            0.0           1.0         16   
7         7            0           1            0.0           1.0         97   
8         8            0           1            0.0           1.0         50   
9         9            0           1            0.0           1.0         71   

   randomfloat  
0     0.417370  
1     0.970567  
2     0.836402  
3     0.029296  
4     0.179799  
5     0.928002  
6     0.354385  
7     0.646790  
8     0.191453  
9     0.088505  
indexone                             Sparse[int8, 0]
samezeroint                          Sparse[int8, 0]
sameoneint                           Sparse[int8, 1]
samezerofloat                   Sparse[float64, 0.0]
sameonefloat                    Sparse[float64, 1.0]
randomint                           Sparse[int8, 49]
randomfloat      Sparse[float64, 0.8838354729582943]
dtype: object


Dataframe: should be identical to above
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: sameoneint, dtype: bool
   indexone  samezeroint  sameoneint  samezerofloat  sameonefloat  randomint  \
0         0          NaN         NaN            NaN           NaN         60   
1         1          NaN         NaN            NaN           NaN         68   
2         2          NaN         NaN            NaN           NaN         65   
3         3          NaN         NaN            NaN           NaN        100   
4         4          NaN         NaN            NaN           NaN         53   
5         5          NaN         NaN            NaN           NaN         26   
6         6          NaN         NaN            NaN           NaN         16   
7         7          NaN         NaN            NaN           NaN         97   
8         8          NaN         NaN            NaN           NaN         50   
9         9          NaN         NaN            NaN           NaN         71   

   randomfloat  
0     0.417370  
1     0.970567  
2     0.836402  
3     0.029296  
4     0.179799  
5     0.928002  
6     0.354385  
7     0.646790  
8     0.191453  
9     0.088505  
indexone                            Sparse[int64, 0]
samezeroint                       Sparse[float64, 0]
sameoneint                        Sparse[float64, 1]
samezerofloat                   Sparse[float64, 0.0]
sameonefloat                    Sparse[float64, 1.0]
randomint                           Sparse[int8, 49]
randomfloat      Sparse[float64, 0.8838354729582943]

我正在使用导入的所有模块的最新版本。正如您希望看到的那样，由于我将“列中全为零的所有行都设为零”，我应该创建一个相同的DF-而是仅创建列中有些变化的Series。 / p>

对于子设置命令，我尝试了所有可以找到的变体：

newdf = testdf.loc[testdf['sameoneint'] == 1]

newdf =testdf.query('sameoneint == 1')

isone = testdf.loc[:, 'sameoneint'].isin([1])

newdf = testdf[isone]

这些方法都无法更好地工作，有些方法会发出有关调用to_dense的警告。

那么，我是在我的编码方式中遗漏了某些东西，还是我还没有弄清楚熊猫如何工作中的某些内容？咨询最欢迎！

当整列与填充值匹配时，Pandas数据框子设置将返回NaN

0 个答案: