我目前的代码如下:
import pandas as pd
import csv
import matplotlib.pyplot as plt
def data_reader(filename, rowname):
with open(filename, newline='') as fp:
yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
if row[0] == rowname)
File = 'data.csv'
ASA = pd.DataFrame.from_records(data_reader(File, 'ASA'))
GDS = pd.DataFrame.from_records(data_reader(File, 'GDS'))
SCD = pd.DataFrame.from_records(data_reader(File, 'SCD'))
ASF = pd.DataFrame.from_records(data_reader(File, 'ASF'))
ADC = pd.DataFrame.from_records(data_reader(File, 'ADC'))
DFS = pd.DataFrame.from_records(data_reader(File, 'DFS'))
DCS = pd.DataFrame.from_records(data_reader(File, 'DCS'))
DFDS = pd.DataFrame.from_records(data_reader(File, 'DFDS'))
正在读取如下数据:
legend, useless data, useless data, DCS, useless data, sped, air, xds, sas, dac
legend, useless data, useless data, GDS, useless data, sped, air
Legend, useless data, useless data, ASA, useless data, sped, air, gnd
ASA, 231, 123, 12
GDS, 12, 1
DCS, 13, 12, 123, 12, 4
ASA, 123, 132, 12
and so on for couple of millions....
我正在尝试编写一个看起来像这样的IF语句:
pd.DataFrame.from_records(data_reader(
if rowname = 'ASA'
ASA.append(row)
elif rowname = 'GDS'
GDS.append(row)
等等。这会更快吗?目前,我需要大约1分钟来运行我的代码并绘制一个图表。我相信当我有大约10-15个地块要做的时候会更长。我尝试过编写if / elseif语句的不同方法,但我没有运气。
答案 0 :(得分:0)
你应该可以这样做:
df = pd.read_csv('data.csv')
ASA = df.ix[df[0] == "ASA"]
# etc ...
答案 1 :(得分:0)
从磁盘读取是这里的瓶颈,所以我们应该尽量避免多次读取文件。如果您有足够的内存将整个CSV解析为列表的字典,那么您可以使用
import csv
import collections
import pandas as pd
def data_reader(filename):
dfs = collections.defaultdict(list)
columns = dict()
with open(filename, newline='') as fp:
for row in csv.reader(fp, skipinitialspace=True):
key = row[0].upper()
if key == 'LEGEND':
name = row[3]
columns[name] = row
else:
dfs[key].append(row[1:])
for key in dfs:
num_cols = max(len(row) for row in dfs[key])
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
return dfs
filename = 'data.csv'
dfs = data_reader(filename)
for key in dfs:
print(dfs[key])
循环
for row in csv.reader(fp, skipinitialspace=True):
key = row[0].upper()
...
dfs[key].append(row[1:])
将CSV加载到词典dfs
中。 dict键是'ASA'
之类的字符串,
'GDS'
和'DCS'
。 dict值是列表列表。
另一个循环
for key in dfs:
...
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][:-num_cols:])
将列表列表转换为DataFrame。
if-statement
:
if key == 'LEGEND':
name = row[3]
columns[name] = row
else:
dfs[key].append(row[1:])
如果行以columns
开头(带或不带大小写),记录LEGEND
字典中的行,或者在dfs
字典中记录该行。
稍后在for-loop
:
for key in dfs:
num_cols = max(len(row) for row in dfs[key])
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
key
是'ASA'
等字符串。对于每个键,列数是
通过查找dfs[key]
中行的最大长度获得。
columns[key]
会返回key
的相应图例行。
columns[key][-num_cols:]
会返回该行的最后num_cols
个值。
data_reader
返回的结果是DataFrames的词典:
In [211]: dfs['ASA']
Out[211]:
sped air gnd
0 231 123 12
1 123 132 12
In [212]: dfs['GDS']
Out[212]:
sped air
0 12 1
In [213]: dfs['DCS']
Out[213]:
sped air xds sas dac
0 13 12 123 12 4