我对由稀疏序列组成的熊猫数据帧感到有些奇怪。我可以用填充和类型的值字典来制作DF,没问题,但是当我尝试对该DF进行子集化时,我得到了一些非常奇怪的结果。我能够再现地显示的是,当对由稀疏序列创建的DF进行子集化时,如果某列在整个过程中恰好是相同的(即所有整数都与填充值匹配),则子集DF会将那些列转换为NaN,并且dtype转换为float64。这是下面的一段测试代码:
import random
import numpy as np
import multiprocessing as mp
import pandas as pd
TEST_LINES = 10
samezeroint = [0 for i in range(TEST_LINES)]
sameoneint = [1 for i in range(TEST_LINES)]
samezerofloat = [0.0 for i in range(TEST_LINES)]
sameonefloat = [1.0 for i in range(TEST_LINES)]
indexone = [i for i in range(TEST_LINES)]
randomint = []
randomfloat = []
for i in range(TEST_LINES):
randomint.append(random.randint(0,100))
randomfloat.append(random.random())
testdict = {'indexone': indexone, "samezeroint": samezeroint, 'sameoneint': sameoneint, 'samezerofloat': samezerofloat, 'sameonefloat': sameonefloat, 'randomint': randomint, 'randomfloat': randomfloat}
filldict = {'indexone': 0, "samezeroint": 0, 'sameoneint': 1, 'samezerofloat': 0.0,
'sameonefloat': 1.0, 'randomint': random.randint(0,100), 'randomfloat': random.random()}
dtypedict = {'indexone': np.int8, "samezeroint": np.int8, 'sameoneint': np.int8, 'samezerofloat': np.float,
'sameonefloat': np.float, 'randomint': np.int8, 'randomfloat': np.float}
dospar = {}
for l in testdict:
try:
fill = filldict[l]
except KeyError:
fill = None
try:
datatype = dtypedict[l]
except KeyError:
datatype = np.str
if fill is None:
sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
else:
sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
dospar[l] = sparr
testdf = pd.DataFrame.from_dict(dospar, orient='columns')
# Test a single series
print("\n\nSeries: All zeroes")
samezerointseries = pd.Series(pd.SparseArray(testdict['samezeroint'], fill_value=0, dtype=np.int8))
print("\nOriginal")
print(samezerointseries)
samezero = samezerointseries.isin([0])
samezerozero = samezerointseries[samezero]
print("\nFiltered: should be identical to above")
print(samezerozero)
sameone = samezerointseries.isin([1])
samezeroone = samezerointseries[sameone]
print("\nFiltered: should be empty")
print(samezeroone)
print("\n\nDataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(testdf)
print(testdf.dtypes)
print("\n\nDataframe: should be identical to above")
intone = testdf.loc[:, 'sameoneint'].isin([int(1)])
print(intone)
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(onedf)
print(onedf.dtypes)
运行此测试时,我得到以下结果:
Series: All zeroes
Original
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]
Filtered: should be identical to above
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]
Filtered: should be empty
Series([], dtype: Sparse[int8, 0])
Dataframe:
indexone samezeroint sameoneint samezerofloat sameonefloat randomint \
0 0 0 1 0.0 1.0 60
1 1 0 1 0.0 1.0 68
2 2 0 1 0.0 1.0 65
3 3 0 1 0.0 1.0 100
4 4 0 1 0.0 1.0 53
5 5 0 1 0.0 1.0 26
6 6 0 1 0.0 1.0 16
7 7 0 1 0.0 1.0 97
8 8 0 1 0.0 1.0 50
9 9 0 1 0.0 1.0 71
randomfloat
0 0.417370
1 0.970567
2 0.836402
3 0.029296
4 0.179799
5 0.928002
6 0.354385
7 0.646790
8 0.191453
9 0.088505
indexone Sparse[int8, 0]
samezeroint Sparse[int8, 0]
sameoneint Sparse[int8, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 49]
randomfloat Sparse[float64, 0.8838354729582943]
dtype: object
Dataframe: should be identical to above
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
Name: sameoneint, dtype: bool
indexone samezeroint sameoneint samezerofloat sameonefloat randomint \
0 0 NaN NaN NaN NaN 60
1 1 NaN NaN NaN NaN 68
2 2 NaN NaN NaN NaN 65
3 3 NaN NaN NaN NaN 100
4 4 NaN NaN NaN NaN 53
5 5 NaN NaN NaN NaN 26
6 6 NaN NaN NaN NaN 16
7 7 NaN NaN NaN NaN 97
8 8 NaN NaN NaN NaN 50
9 9 NaN NaN NaN NaN 71
randomfloat
0 0.417370
1 0.970567
2 0.836402
3 0.029296
4 0.179799
5 0.928002
6 0.354385
7 0.646790
8 0.191453
9 0.088505
indexone Sparse[int64, 0]
samezeroint Sparse[float64, 0]
sameoneint Sparse[float64, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 49]
randomfloat Sparse[float64, 0.8838354729582943]
我正在使用导入的所有模块的最新版本。正如您希望看到的那样,由于我将“列中全为零的所有行都设为零”,我应该创建一个相同的DF-而是仅创建列中有些变化的Series。 / p>
对于子设置命令,我尝试了所有可以找到的变体:
newdf = testdf.loc[testdf['sameoneint'] == 1]
newdf =testdf.query('sameoneint == 1')
isone = testdf.loc[:, 'sameoneint'].isin([1])
newdf = testdf[isone]
这些方法都无法更好地工作,有些方法会发出有关调用to_dense的警告。
那么,我是在我的编码方式中遗漏了某些东西,还是我还没有弄清楚熊猫如何工作中的某些内容?咨询最欢迎!