我有file.csv
行~15k行,看起来像这样
SAMPLE_TIME, POS, OFF, HISTOGRAM
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
我希望将其导入pandas.DataFrame
,并将任意值添加到没有标题的列中,如下所示:
SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan
这是不可能导入的,因为我尝试了不同的解决方案,例如给出specific a header,但仍然没有快乐,我能够使其工作的唯一方法是在{{ {1}}文件。这有点打败了自动化的目的!
然后我尝试this solution: 这样做
.csv
它正确地读取了给我一个~15k元素lines=list(csv.reader(open('file.csv')))
header, values = lines[0], lines[1:]
列表的文件,每个元素都是一个字符串列表,其中每个字符串都是从文件中正确解析的数据字段,但是当我尝试这样做时:
values
或者这个:
data = {h:v for h,v in zip (header, zip(*values))}
df = pd.DataFrame.from_dict(data)
然后非标题列消失,列的顺序完全混合。任何可能的解决方案的想法?
答案 0 :(得分:6)
您可以根据第一个实际行的长度创建列:
from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
h, ln = next(f), len(next(f).split(","))
header = h.strip().split(",")
f.seek(0), next(f)
header += range(ln)
print(pd.read_csv(f, names=header))
哪个会给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 13 14 15 16 17 18 19 20 21 22
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
[4 rows x 27 columns]
或者你可以在传递给pandas之前清理文件:
import pandas as pd
from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
for line in f:
t.write(line.replace(" ", ""))
t.seek(0)
ln = len(line.strip().split(","))
header = t.readline().strip().split(",")
header += range(ln)
print(pd.read_csv(t,names=header))
这给了你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \
0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0
1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0
2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0
3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0
12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 NaN NaN NaN
1 50 0 NaN NaN NaN NaN NaN NaN NaN
2 0 4 0 0 0 NaN NaN NaN NaN
3 6 0 0 0 0 NaN NaN NaN NaN
[4 rows x 25 columns]
或删除列将全部为nana:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]
答案 1 :(得分:3)
您可以将列HISTOGRAM
拆分为新DataFrame
,将concat
拆分为原始列。
print df
SAMPLE_TIME, POS, OFF, \
0 2015-07-15 16:41:56 0-0-0-0-3, 1,
1 2015-07-15 16:42:55 0-0-0-0-3, 1,
2 2015-07-15 16:43:55 0-0-0-0-3, 1,
3 2015-07-15 16:44:56 0-0-0-0-3, 1,
HISTOGRAM
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None
2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None
3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None
#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
print df
HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0
11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0
1 0 0 50 0 None None None None
2 0 0 0 4 0 0 0 None
3 0 0 6 0 0 0 0 None None
[4 rows x 24 columns]
答案 2 :(得分:-1)
那怎么样。我从你的样本数据中提取了一个csv。
导入行时:
with open('test.csv','rb') as f:
lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]
生成漂亮的标题名称,请使用以下行:
headers = [i or ind for ind, i in enumerate(headers)]
因为(我假设)csv如何工作,标题应该有一堆空字符串值。空字符串计算结果为False,因此此解析返回每列没有标题的编号列。
然后做一个df:
df = pd.DataFrame(values,columns=headers)
看起来像:
11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \
0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0
1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0
2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0
3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0
... 12 13 14 15 16 17 18 19 20 21
0 ... 2 0 0 0 0 0 0 0 0 0
1 ... 2 0 0 0 50 0
2 ... 2 0 0 0 0 4 0 0 0
3 ... 2 0 0 0 6 0 0 0 0
[4 rows x 22 columns]
答案 3 :(得分:-2)
假设您的数据位于名为foo.csv的文件中,您可以执行以下操作。这是针对Pandas 0.17进行测试的
df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)