我有一个通过pandas store.append存储的大型数据集(400万行,50列)。当我使用store.select或read_hdf查询2列大于某个值时(即"(a> 10)&(b> 1)"我得到15,000左右行返回。
当我读完整张桌子时,就像说df一样,做df [(df.a> 10)& (df.b> 1)]我得到30,000行。我缩小了问题的范围 - 当我在整个表格中阅读并执行df.query("(a> 10)&(b> 1)")它是相同的15,000行,但当我将引擎设置为python ---> df.query("(a> 10)&(b> 1)",engine =' python')我得到了30,000行。
我怀疑它与在HDF和查询方法中查询的eval / numexpr方法有关。
在a和b列中,类型是float64&,即使我使用float查询(即1.而不是1),问题仍然存在。
我希望得到任何反馈,或者如果其他人遇到同样的问题我们需要解决这个问题。
此致 尼尔
========================
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.14.1
nose: 1.3.3
Cython: None
numpy: 1.8.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: 0.91.3
dateutil: 2.2
pytz: 2013.8
bottleneck: 0.7.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: None
html5lib: 0.95-dev
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Int64Index: 15533 entries, 67302 to 142465
Data columns (total 47 columns):
date 15533 non-null datetime64[ns]
text 15533 non-null object
date2 1090 non-null datetime64[ns]
x1 15533 non-null float64
x2 15533 non-null float64
x3 15533 non-null float64
x4 15533 non-null float64
x5 15533 non-null float64
x6 15533 non-null float64
x7 15533 non-null float64
x8 15533 non-null float64
x9 15533 non-null float64
x10 15533 non-null float64
x11 15533 non-null float64
x12 15533 non-null float64
x13 15533 non-null float64
x14 15533 non-null float64
x15 15533 non-null float64
x16 15533 non-null float64
x17 15533 non-null float64
x18 15533 non-null float64
a 15533 non-null float64
x19 15533 non-null float64
x20 15533 non-null float64
x21 15533 non-null float64
x22 15533 non-null float64
x23 15533 non-null float64
x24 15533 non-null float64
b 15533 non-null float64
x25 15533 non-null float64
x26 15533 non-null float64
x27 15533 non-null float64
x28 15533 non-null float64
x29 15533 non-null float64
x30 15533 non-null float64
x31 15497 non-null float64
x32 15497 non-null float64
x33 15497 non-null float64
x34 15497 non-null float64
x35 15533 non-null int64
x36 15533 non-null int64
x37 15533 non-null int64
x38 15533 non-null int64
x39 15533 non-null int64
x40 15533 non-null int64
x41 15533 non-null int64
x42 15533 non-null int64
dtypes: datetime64ns, float64(36), int64(8), object(1)
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/MKT (Group) ''
/MKT._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['date', 'text', 'date2', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'a', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'b', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']]
/MKT/table (Table(3637597,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(29,), dflt=0.0, pos=2),
"date": Int64Col(shape=(), dflt=0, pos=3),
"text": StringCol(itemsize=30, shape=(), dflt='', pos=4),
"a": Float64Col(shape=(), dflt=0.0, pos=5),
"x20": Float64Col(shape=(), dflt=0.0, pos=6),
"x23": Float64Col(shape=(), dflt=0.0, pos=7),
"x24": Float64Col(shape=(), dflt=0.0, pos=8),
"b": Float64Col(shape=(), dflt=0.0, pos=9),
"x25": Float64Col(shape=(), dflt=0.0, pos=10),
"x26": Float64Col(shape=(), dflt=0.0, pos=11),
"x35": Int64Col(shape=(), dflt=0, pos=12),
"x36": Int64Col(shape=(), dflt=0, pos=13),
"x37": Int64Col(shape=(), dflt=0, pos=14),
"x38": Int64Col(shape=(), dflt=0, pos=15),
"x39": Int64Col(shape=(), dflt=0, pos=16),
"x40": Int64Col(shape=(), dflt=0, pos=17),
"x41": Int64Col(shape=(), dflt=0, pos=18),
"x42": Int64Col(shape=(), dflt=0, pos=19)}
byteorder := 'little'
chunkshape := (322,)
autoindex := True
colindexes := {
"x41": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x20": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x37": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x42": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x26": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x38": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x40": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x36": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"text": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x23": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x39": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x25": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x24": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x35": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"b": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/MKT/table._v_attrs (AttributeSet), 83 attributes:
[CLASS := 'TABLE',
x23_dtype := 'float64',
x23_kind := ['x23'],
x20_dtype := 'float64',
x20_kind := ['x20'],
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_10_FILL := 0.0,
FIELD_10_NAME := 'x25',
FIELD_11_FILL := 0.0,
FIELD_11_NAME := 'x26',
FIELD_12_FILL := 0,
FIELD_12_NAME := 'x35',
FIELD_13_FILL := 0,
FIELD_13_NAME := 'x36',
FIELD_14_FILL := 0,
FIELD_14_NAME := 'x37',
FIELD_15_FILL := 0,
FIELD_15_NAME := 'x38',
FIELD_16_FILL := 0,
FIELD_16_NAME := 'x39',
FIELD_17_FILL := 0,
FIELD_17_NAME := 'x40',
FIELD_18_FILL := 0,
FIELD_18_NAME := 'x41',
FIELD_19_FILL := 0,
FIELD_19_NAME := 'x42',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'date',
FIELD_4_FILL := '',
FIELD_4_NAME := 'text',
FIELD_5_FILL := 0.0,
FIELD_5_NAME := 'a',
FIELD_6_FILL := 0.0,
FIELD_6_NAME := 'x20',
FIELD_7_FILL := 0.0,
FIELD_7_NAME := 'x23',
FIELD_8_FILL := 0.0,
FIELD_8_NAME := 'x24',
FIELD_9_FILL := 0.0,
FIELD_9_NAME := 'b',
a_dtype := 'float64',
a_kind := ['a'],
NROWS := 3637597,
TITLE := '',
VERSION := '2.7',
x24_dtype := 'float64',
x24_kind := ['x24'],
b_dtype := 'float64',
b_kind := ['b'],
x25_dtype := 'float64',
x25_kind := ['x25'],
x26_dtype := 'float64',
x26_kind := ['x26'],
date_dtype := 'datetime64',
date_kind := ['date'],
x39_dtype := 'int64',
x39_kind := ['x39'],
x37_dtype := 'int64',
x37_kind := ['x37'],
x41_dtype := 'int64',
x41_kind := ['x41'],
x35_dtype := 'int64',
x35_kind := ['x35'],
x40_dtype := 'int64',
x40_kind := ['x40'],
x38_dtype := 'int64',
x38_kind := ['x38'],
x42_dtype := 'int64',
x42_kind := ['x42'],
x36_dtype := 'int64',
x36_kind := ['x36'],
index_kind := 'integer',
text_dtype := 'string240',
text_kind := ['text'],
values_block_0_dtype := 'datetime64',
values_block_0_kind := ['date2'],
values_block_1_dtype := 'float64',
values_block_1_kind := ['x22', 'x18', 'x21', 'x16', 'x19', 'x17', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x29', 'x30', 'x28', 'x2', 'x1', 'x3', 'x10', 'x27', 'x11', 'x12', 'x13', 'x14', 'x15', 'x33', 'x32', 'x34', 'x31']]
df = DataFrame()store = pd.HDFStore('/Users/neil/MKT.h5')
df = store.select('MKT', "(a > 10) & (b > 1)")
store.close()
store = pd.HDFStore('/Users/neil/MKT.h5')
listofsearchablevars = ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']
df = .....
store.append('MKT', df, data_columns = listofsearchablevars, nan_rep = 'nan', chunksize=500000, min_itemsize = {'values': 30})
store.close()
编辑:回应提供一些样本数据的请求....
为清楚起见, 让我们称之为15,000个结果:" INCORRECT" 让我们称之为30,000个结果:" CORRECT" 让我们在正确的情况下调用项目,但不要在不正确的情况下调用:"仅在正确的#34;
我已经确认,INCORRECT中的所有行/项都已完全找到 在CORRECT。
9869 9870
date 2001-08-10 00:00:00 2001-08-17 00:00:00
text DCR DCR
date2 NaN NaN
x19 1.9 1.8396
x18 1.98 1.9
x20 1.8 1.8
x9 2.54 2.54
x10 5.25 5.125
x11 9.625 9.625
x12 1.61 1.7
x13 1.05 1.05
x14 1.05 1.05
x21 75700 64800
x23 140992.7 116948.9
x24 0.0008284454 0.0007097211
x25 0.002580505 0.002630241
x26 0.001540047 0.001440302
x27 0.001850877 0.001832468
x5 17.915 17.915
x8 17.915 17.915
x2 34.0379 32.9563
a 34.0385 32.95643
x6 -42.80079 -42.80079
x7 -8.762288 -9.844354
x4 0 0
x1 -0.0003349149 -0.0003349149
x3 -0.0003349149 -0.0003349149
x28 1.579e+07 1.579e+07
b 1.261029 1.302433
x29 1.284075 1.326236
x30 1.488814 1.537697
x22 -0.2891579 -0.3205045
x17 0.31 0.31
x15 0.84 0.84
x16 2.5937 2.5937
x34 6.895 7.105
x32 -1.29055 -1.35055
x31 -0.77 -0.63
x33 -0.665 -0.49
x38 1 1
x42 0 0
x36 0 0
x40 0 0
x35 0 0
x39 0 0
x37 0 0
x41 0 0
153641 153642
date 2008-08-22 00:00:00 2008-08-29 00:00:00
text PRL PRL
date2 NaN NaN
x19 1.9 1.88
x18 1.95 1.94
x20 1.85 1.87
x9 2.07 2.07
x10 2.23 2.23
x11 2.94 2.94
x12 1.75 1.75
x13 1.71 1.71
x14 1.69 1.69
x21 133549 73525
x23 254119.1 140764.5
x24 0.001485416 0.0008315729
x25 0.001227271 0.001204803
x26 0.001006876 0.001048327
x27 0.0009764919 0.0009638125
x5 18.008 18.008
x8 18.058 18.058
x2 34.2152 33.855
a 34.3102 33.94904
x6 -35.07229 -35.07229
x7 -0.7620911 -1.123251
x4 0 0
x1 0.0111308 0.0111308
x3 0.0111308 0.0111308
x28 1.5488e+08 1.5488e+08
b 1.251983 1.265302
x29 1.272828 1.286369
x30 1.247996 1.261273
x22 0.1368421 0.1489362
x17 0.16 0.16
x15 0.2 0.2
x16 0.47 0.47
x34 2.25 2.34
x32 1.395 1.365
x31 1.25 1.31
x33 1.175 1.25
x38 1 1
x42 0 0
x36 0 0
x40 0 0
x35 0 0
x39 0 0
x37 0 0
x41 0 0
99723 99725
date 2009-11-27 00:00:00 2009-12-11 00:00:00
text ACL ACL
date2 NaN NaN
x19 1.17 1.2
x18 1.22 1.39
x20 1.11 1.14
x9 1.76 1.76
x10 1.76 1.76
x11 1.76 1.76
x12 0.63 0.74
x13 0.36 0.36
x14 0.17 0.17
x21 285474 709374
x23 333678.1 868999.7
x24 0.0005489386 0.001393863
x25 0.002350057 0.002279827
x26 0.002160912 0.002111369
x27 0.002428953 0.002244943
x5 103.908 103.908
x8 103.908 103.908
x2 121.5721 124.6894
a 121.5724 124.6896
x6 92.16074 92.16074
x7 213.7331 216.8503
x4 0 0
x1 -0.008266928 -0.008266928
x3 -0.008266928 -0.008266928
x28 0.02743141 0.02703708
b 1.037747 1.011804
x29 1.421532 1.385994
x30 1.52714 1.488961
x22 1.213675 1.7
x17 0.47 0.47
x15 0.48 0.48
x16 0.48 0.48
x34 0.32 0.32
x32 1.04 1.04
x31 -0.6 -0.6
x33 -0.5901 -0.479
x38 0 0
x42 0 0
x36 0 0
x40 0 0
x35 0 0
x39 0 0
x37 0 0
x41 0 0
答案 0 :(得分:0)
SUCESS !!!!!我在数据中填写了所有NaN,现在read_hdf返回正确的30,000行。列a具有NaN(这是查询中的data_columns之一,a> 10)。伙计,那很痛苦。仅供参考 - 由于我的偏执,为了摆脱将来可能重演的任何可能的情况,我完全填写(0)整个表格,因为我不能冒险从这个分析得出结论,来自表格的查询不正确或不完整。肯定是NaN问题。