如何在Python中存储大型柱状文本+数值数据?

时间:2016-11-11 09:41:43

标签: pandas pickle blaze

要在不构建柱状数据库的情况下保存在磁盘上,请执行以下操作:

import java.Math.;

public int count-digits (int num){
int count = 0;
String numF = string.valueOf(num);
  // We get the number of digits by logs.
  for(int j=0; j <= 9; j++){ //loop for each digits
    for(int i=0; i < Math.floor(Math.log10(num)); i++){ //this loops checks each no.
      if(numF.charAt(j).equals(i)){
         count++;
      }
      return count;
      count=0;
    }      
  }
}

只是想知道哪一个在速度方面效率最高? 感谢

1 个答案:

答案 0 :(得分:1)

我考虑羽毛,HDF5。 MySQL或PostgreSQL - 也可能是一个选项,具体取决于您将如何查询数据......

以下是HDF5的演示:

In [33]: df = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 3)), columns=list('abc'))

In [34]: df['txt'] = 'X' * 300

In [35]: df
Out[35]:
           a       b       c                                                txt
0     689347  129498  770470  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
1     954132   97912  783288  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
2      40548  938326  861212  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
3     869895   39293  242473  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
4     938918  487643  362942  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...

In [37]: df.to_hdf('c:/temp/test_str.h5', 'test', format='t', data_columns=['a', 'c'])

In [38]: store = pd.HDFStore('c:/temp/test_str.h5')

In [39]: store.get_storer('test').table
Out[39]:
/test/table (Table(10000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": StringCol(itemsize=300, shape=(1,), dflt=b'', pos=2),  # <---- NOTE
  "a": Int32Col(shape=(), dflt=0, pos=3),
  "c": Int32Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (204,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "c": Index(6, medium, shuffle, zlib(1)).is_csi=False}