加速我在python中的数据读取?

时间:2016-11-21 22:26:34

标签: python pandas

我目前的代码如下:

import pandas as pd
import csv
import matplotlib.pyplot as plt

def data_reader(filename, rowname):
    with open(filename, newline='') as fp:
        yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
            if row[0] == rowname)
File = 'data.csv'
ASA = pd.DataFrame.from_records(data_reader(File, 'ASA'))
GDS = pd.DataFrame.from_records(data_reader(File, 'GDS'))
SCD = pd.DataFrame.from_records(data_reader(File, 'SCD'))
ASF = pd.DataFrame.from_records(data_reader(File, 'ASF'))
ADC = pd.DataFrame.from_records(data_reader(File, 'ADC'))
DFS = pd.DataFrame.from_records(data_reader(File, 'DFS'))
DCS = pd.DataFrame.from_records(data_reader(File, 'DCS'))
DFDS = pd.DataFrame.from_records(data_reader(File, 'DFDS'))

正在读取如下数据:

legend, useless data, useless data, DCS, useless data, sped, air, xds, sas, dac
legend, useless data, useless data, GDS, useless data, sped, air
Legend, useless data, useless data, ASA, useless data, sped, air, gnd 
ASA, 231, 123, 12
GDS, 12, 1
DCS, 13, 12, 123, 12, 4
ASA, 123, 132, 12
and so on for couple of millions....

我正在尝试编写一个看起来像这样的IF语句:

pd.DataFrame.from_records(data_reader(
    if rowname = 'ASA'
        ASA.append(row)
    elif rowname = 'GDS'
        GDS.append(row)

等等。这会更快吗?目前,我需要大约1分钟来运行我的代码并绘制一个图表。我相信当我有大约10-15个地块要做的时候会更长。我尝试过编写if / elseif语句的不同方法,但我没有运气。

2 个答案:

答案 0 :(得分:0)

你应该可以这样做:

df = pd.read_csv('data.csv')
ASA = df.ix[df[0] == "ASA"]
# etc ...

答案 1 :(得分:0)

从磁盘读取是这里的瓶颈,所以我们应该尽量避免多次读取文件。如果您有足够的内存将整个CSV解析为列表的字典,那么您可以使用

import csv
import collections
import pandas as pd

def data_reader(filename):
    dfs = collections.defaultdict(list)
    columns = dict()
    with open(filename, newline='') as fp:
        for row in csv.reader(fp, skipinitialspace=True):
            key = row[0].upper()
            if key == 'LEGEND':
                name = row[3]
                columns[name] = row
            else:
                dfs[key].append(row[1:])

    for key in dfs:
        num_cols = max(len(row) for row in dfs[key])
        dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
    return dfs

filename = 'data.csv'
dfs = data_reader(filename)

for key in dfs:
    print(dfs[key])

循环

for row in csv.reader(fp, skipinitialspace=True):
    key = row[0].upper()
    ...
    dfs[key].append(row[1:])

将CSV加载到词典dfs中。 dict键是'ASA'之类的字符串, 'GDS''DCS'。 dict值是列表列表。

另一个循环

for key in dfs:
    ...
    dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][:-num_cols:])

将列表列表转换为DataFrame。

if-statement

if key == 'LEGEND':
    name = row[3]
    columns[name] = row
else:
    dfs[key].append(row[1:])
如果行以columns开头(带或不带大小写),

记录LEGEND字典中的行,或者在dfs字典中记录该行。

稍后在for-loop

for key in dfs:
    num_cols = max(len(row) for row in dfs[key])
    dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])

key'ASA'等字符串。对于每个键,列数是 通过查找dfs[key]中行的最大长度获得。

columns[key]会返回key的相应图例行。 columns[key][-num_cols:]会返回该行的最后num_cols个值。

data_reader返回的结果是DataFrames的词典:

In [211]: dfs['ASA']
Out[211]: 
  sped  air gnd 
0  231  123   12
1  123  132   12

In [212]: dfs['GDS']
Out[212]: 
  sped air
0   12   1

In [213]: dfs['DCS']
Out[213]: 
  sped air  xds sas dac
0   13  12  123  12   4