Question

我有一个大文件（2GB）的分类数据（主要是＆＃34; Nan＆＃34; - 但这里和那里有实际值填充）太大而无法读入单个数据帧。我很难想出一个对象来存储每列的所有唯一值（这是我的目标 - 最终我需要将其分解为建模）

我最终做的是将文件以块的形式读入数据帧，然后获取每列的唯一值并将它们存储在列表列表中。我的解决方案有效，但似乎最不py - 是否有更简洁的方法来完成Python（版本3.5）。我知道列数（~2100）。

import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)

collist=[]
master=[]
i=0
initialize=0
for chunk in data:
    #so the first time through I have to make the "master" list
    if initialize==0:
        for col in chunk:
            #thinking about this, i should have just dropped this col
            if col=='Id':
                continue
            else:
                #use pd.unique as a build in solution to get unique values
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                master.append(collist)
                i=i+1
    #but after first loop just append to the master-list at
    #each master-list element
    if initialize==1:
        for col in chunk:
            if col=='Id':
                continue
            else:
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                for item in collist:
                    master[i]=master[i]+collist
                i=i+1
    initialize=1
    i=0

之后，我对所有唯一值的最终任务如下：

i=0
names=chunk.columns.tolist()
for item in master:
     master[i]=list(set(item))
     master[i]=master[i].append(names[i+1])
     i=i+1

因此master [i]给我列名称，然后是一个唯一值列表 - 粗略但它确实有效 - 我主要关注的是在＆＃34;更好的＆＃34;如果可能的话。

Answer 1

我建议使用collections.defaultdict(set)而不是list list个uniques = collections.defaultdict(set)。

假设您从

开始

for chunk in data: 
    for col in chunk:
        uniques[col] = uniques[col].union(chunk[col].unique())

现在循环可以变成这样：

defaultdict

请注意：

set uniques[col]总是initialized（这就是它的用途），因此您可以跳过col和其他内容。
< / LI>
对于给定的uniques[col].update(chunk[col].unique())，您只需使用当前集合（最初为空，但无关紧要）和新的唯一元素更新条目。

修改

正如Raymond Hettinger所说（谢谢！），最好使用

var store = Ext.create('Ext.data.JsonStore', { storeId:'thisstore', proxy: { type: 'ajax', url: 'Stores/ShowModule/showGrid.php', simpleSortMode: true, reader: { type: 'json', rootProperty: 'data', totalProperty: 'total' } }, fields: [a,b,c,d], autoLoad: true });

Pythonic增加列表列表的方法

1 个答案: