Pandas根据标题选择列和数据

时间:2016-08-25 16:08:11

标签: python python-2.7 csv pandas columnheader

我有一个很大的.csv文件。我想只选择带有时间/日期的列和我通过标题知道的20个其他列。

作为测试,我尝试仅使用标题为“TIMESTAMP”的列,我知道这是 .csv中长4207823行,它只包含日期和时间。下面的代码选择TIMESTAMP列,但也继续从其他列获取值,如下所示:

import csv
import numpy as np
import pandas

low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited

time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time

输出

                  TIMESTAMP
0       2007-08-15 21:10:00
1       2007-08-15 21:20:00
2       2007-08-15 21:30:00
3       2007-08-15 21:40:00
4       2007-08-15 21:50:00
5       2007-08-15 22:00:00
6       2007-08-15 22:10:00
7       2007-08-15 22:20:00
8       2007-08-15 22:30:00
9       2007-08-15 22:40:00
10      2007-08-15 22:50:00
11      2007-08-15 23:00:00
12      2007-08-15 23:10:00
13      2007-08-15 23:20:00
14      2007-08-15 23:30:00
15      2007-08-15 23:40:00
16      2007-08-15 23:50:00
17      2007-08-16 00:00:00
18      2007-08-16 00:10:00
19      2007-08-16 00:20:00
20      2007-08-16 00:30:00
21      2007-08-16 00:40:00
22      2007-08-16 00:50:00
23      2007-08-16 01:00:00
24      2007-08-16 01:10:00
25      2007-08-16 01:20:00
26      2007-08-16 01:30:00
27      2007-08-16 01:40:00
28      2007-08-16 01:50:00
29      2007-08-16 02:00:00 #these are from the TIMESTAMP column
...                     ...
679302              221.484 #This is from another column 
679303                  NaN
679304  2015-09-23 06:40:00
679305                  NaN
679306                  NaN
679307  2015-09-23 06:50:00
679308                  NaN
679309                  NaN
679310  2015-09-23 07:00:00

1 个答案:

答案 0 :(得分:0)

问题是由于输入文件中的错误导致usecolspandas.read_csv的使用变得简单。

下面的

代码演示了几列数据的选择

import csv
import pandas

low_memory=False



    #read only the selected columns
    df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
    usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
    print df # see what the data looks like
    outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
    df.to_csv(outfile)#save selection to the blank .csv created above