我想读一个有1000行的csv文件,所以我决定以块的形式读取这个文件。但是我在阅读这个csv文件时遇到了问题。
我想在第一次迭代时读取前10条记录,并在第二次迭代时将其特定列转换为python字典,先跳过前10条记录,然后读取下一条10条记录。
Input.csv -
time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"
程序代码 -
if __name__ == '__main__':
s = 0
while(True):
n = 10
df = pandas.read_csv('Input.csv', skiprows=s, nrows=n)
d = dict(zip(df.time, df.split_counts))
print d
s += n
我正面临这个问题 -
AttributeError: 'DataFrame' object has no attribute 'time'
我知道在第二次迭代中它无法识别时间和split_counts属性但是有没有办法做我想要的?
答案 0 :(得分:1)
第一次迭代应该可以正常工作,但任何进一步的迭代都是有问题的。
read_csv
有一个headers
kwarg,默认值为infer
(基本上是0
)。这意味着解析后的csv中的第一行将用作数据框中列的名称。
read_csv
还有另一个kwarg,names
。
header:int或int列表,默认'推断' 用作列名的行号和数据的开头。如果没有传递名称,则默认行为就像设置为0,否则为None。显式传递header = 0以便能够替换现有名称。标题可以是整数列表,其指定列上的多索引的行位置,例如, [0,1,3]。将跳过未指定的干预行(例如,跳过此示例中的2)。请注意,如果skip_blank_lines = True,此参数将忽略注释行和空行,因此header = 0表示第一行数据而不是文件的第一行。
names:array-like,默认为None 要使用的列名列表。如果文件不包含标题行,则应显式传递header = None
您应该将headers=None
和names=['time', 'line_id', 'high', 'low', 'avg', 'total', 'split_counts']
传递给read_csv
。
答案 1 :(得分:1)
您可以在read_csv
中使用<iframe name="right_side" src="" width="50%" height="50%" ></iframe>
:
chunksize
import pandas as pd
import io
temp=u'''time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"'''
#after testing replace io.StringIO(temp) to filename
#for testing 2
reader = pd.read_csv(io.StringIO(temp), chunksize=2)
print (reader)
<pandas.io.parsers.TextFileReader object at 0x000000000AD1CD68>