Question

在我的工作中，我经常会收到一个大的csv文件，其中没有任何关于内容或格式的信息。我正在尝试开发一个工作流来自动推断列的数据类型以及对象dtypes的最大字符串长度，最终目标是将格式化数据集存储在HDFStore中。我正在寻求帮助，为这种情况提出最佳实践。我有一些有用的东西，但似乎效率低下：

此示例的数据可在此处找到：http://www.kaggle.com/c/loan-default-prediction/data

import pandas as pd

# first pass to determine file formats using pd.read_csv inference
fmts = []
chunker = pd.read_csv('../data/train.csv', chunksize=10000)

for chunk in chunker:
    fmts.append(chunk.dtypes)

fmts = reduce(lambda x,y: x.combine(y, max), fmts)

此前一段代码会为每个块累积推断的dtypes，然后以最大值减少它们：

In[1]:fmts[:10]
Out[1]: 
id      int64
f1      int64
f2      int64
f3    float64
f4      int64
f5      int64
f6      int64
f7    float64
f8    float64
f9    float64
dtype: object

所以第一步就完成了。我创建了一个数据类型列表，可以在后续运行时传递给read_csv。现在，找到object列中string列的最大长度HDFStore：

# second pass now get max lengths of objects
objs = fmts[fmts == 'object'].index
cnvt = {obj : str for obj in objs}
lens = []

chunker = pd.read_csv('../data/train.csv', chunksize=10000,
                      converters=cnvt, usecols=objs)
for chunk in chunker:
    for col in chunk:
        lens.append(chunk.apply(lambda x: max(x.apply(len))))

# reduce the lens into one
lens = dict(reduce(lambda x,y: x.combine(y, max), lens))

我现在有一个字典，其中object类型的列是键，所有块的最大单元格长度是值：

In[2]:lens
Out[2]: 
{'f137': 20,
 'f138': 26,
 'f206': 20,
 'f207': 27,
 'f276': 20,
 'f277': 27,
 'f338': 26,
 'f390': 32,
 'f391': 42,
 'f419': 20,
 'f420': 26,
 'f466': 19,
 'f469': 27,
 'f472': 35,
 'f534': 27,
 'f537': 35,
 'f626': 32,
 'f627': 42,
 'f695': 22,
 'f698': 22}

我的最后一步是使用推断的格式和长度将所有内容存储在HDFStore表中：

# Lastly loop through once more to append to an HDFStore table!
store = pd.HDFStore("../data/train.h5")

chunker = pd.read_csv('../data/train.csv', chunksize=10000, dtype=dict(fmts))
for chunk in chunker:
    store.append('train', chunk, min_itemsize=lens)

这个工作流程有意义吗？其他人如何处理不适合内存且需要存储在HDFStore磁盘上的大型数据集？

迭代地自动在大型数据集上推断dtypes和min_itemsize

0 个答案: