我是python中的'新手'(2周前开始学习),我正在尝试绘制一个看起来像这样的文件:
"1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
使用以下任何一项:pandas,matplotlib和numpy。但我没有取得多大成功。我尝试搜索示例但没有应用于我的数据格式。
有人可以帮我找出如何在pandas数据框中加载这个文件,或者(甚至会更好)向我展示如何绘制这个?
详细信息:
在@Goyo的帮助下,我将方法convert()
更改为:
#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
line = line.strip()
print(line)
if line == "":
continue
if line.startswith('"'):
name = line.strip('"')
print("NAME:: " + name)
else:
out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )
要绘制我正在使用以下代码:
with open('nro_caribou.dat') as in_file:
with open('output.txt', 'w+') as out_file:
convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
print(df)
df.pivot(values='Time', index='Param', columns='Methods').plot()
我的原始数据:https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83
我的情节:
答案 0 :(得分:2)
AFAIK在pandas,matplotlib或numpy中没有内置功能来读取像那样的文件。如果你对数据格式有一些控制权,我建议你改变它。
如果您没有选择但使用该格式,您可以使用python I / O和字符串操作功能自己解析数据(我不认为pandas可以使这更容易,它不是为了处理这些类型而设计的文件)。
此功能可以将数据格式转换为更适合pandas的数据:
def convert(in_file, out_file):
for line in in_file:
line = line.rstrip(' \n\r')
if not line:
continue
if line.startswith('"'):
name = line.strip('"')
else:
out_file.write('{}\n'.format(','.join([name] + line.split())))
如果您的原始文件是' input.txt'你会这样使用它:
with open('input.txt') as in_file:
with open('output.txt', 'w') as out_file:
convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
names=['Series', 'X', 'Y'])
print(df)
Series X Y
0 1st Serie 2 23
1 1st Serie 4 12
2 1st Serie 6 12
3 2nd Serie 2 51
4 2nd Serie 4 90
5 2nd Serie 6 112
df.pivot(index='X', columns='Series', values='Y').plot()
答案 1 :(得分:1)
我认为您只能read_csv
一次,然后发布处理创建dataframe
:
import pandas as pd
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
squeeze=True,
header=None) #try convert output to series
print s
"1stSerie" NaN
2 23
4 12
6 12
"2ndSerie" NaN
2 51
4 90
6 112
Name: 0, dtype: float64
df = s.reset_index()
#set column names
df.columns = ['idx','val']
#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0 NaN
1 2
2 4
3 6
4 NaN
5 2
6 4
7 6
Name: idx, dtype: float64
#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
Name: idx, dtype: bool
#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']
#forward fill NaN values
df['names'] = df['names'].ffill()
#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]
print df
idx val names
1 2 23 "1stSerie"
2 4 12 "1stSerie"
3 6 12 "1stSerie"
5 2 51 "2ndSerie"
6 4 90 "2ndSerie"
7 6 112 "2ndSerie"
df.pivot(index='idx', columns='names', values='val').plot()
或者您可以使用read_csv
和plot
。如果您需要将Series
设置为legend
,请使用figure
和legend
:
import pandas as pd
import matplotlib.pyplot as plt
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112"""
s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
nrows=3, #read only 3 rows of data
squeeze=True) #try convert output to series
print s1
2 23
4 12
6 12
Name: "1stSerie", dtype: int64
#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
header=4, #read row 4 to header - series name
engine='python',
nrows=3,
squeeze=True)
print s2
2 51
4 90
6 112
Name: "2ndSerie", dtype: int64
plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])
或者您只能阅读一次文件,然后将Serie
s
剪切为Series
s1
,s2
和s3
,然后创建{ {1}}:
DataFrame
import pandas as pd
import matplotlib.pyplot as plt
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
"3rdSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
squeeze=True) #try convert output to series
print s
2 23
4 12
6 12
"2ndSerie" NaN
2 51
4 90
6 112
"3rdSerie" NaN
2 51
4 90
6 112
Name: "1stSerie", dtype: float64
答案 2 :(得分:1)
您可以使用itertools.groupby
单步执行该文件。下面的LastHeader类检查每一行的sentinal字符。如果角色在那里,标题行会更新,itertools.groupby
会开始新的细分。唯一一个遇到数据集问题的地方就是你有两个标有" CRE"的系列。我的解决方法是从文本文件中删除第二个,但您可能想要做其他事情。
这里的结果是你可以在一次通过中隐藏数据。没有必要的写作和回读。
from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series
class LastHeader():
"""Checks for new header strings. For use with groupby"""
def __init__(self, sentinel='#'):
self.sentinel = sentinel
self.lastheader = ''
self.index=0
def check(self, line):
self.index += 1
if line.startswith(self.sentinel):
self.lastheader = line
return self.lastheader
fname = 'dist_caribou.dat'
with open(fname, 'r') as fobj:
lastheader = LastHeader('"')
data = []
for headerline, readlines in groupby(fobj, lastheader.check):
name = headerline.strip().strip('"')
thisdat = np.loadtxt(readlines, comments='"')
data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()
答案 3 :(得分:0)
鉴于pandas中read_csv
的适当参数,这对绘图来说相对微不足道。
s1 = pd.read_csv('series1.txt',
index_col=0,
sep=" ",
squeeze=True,
header=0,
skipinitialspace=True)
>>> s1
tSerie
2 23
4 12
6 12
Name: Unnamed: 1, dtype: int64
s2 = pd.read_csv('series2.txt',
index_col=0,
sep=" ",
squeeze=True,
header=0,
skipinitialspace=True)
%matplotlib inline # If not already enabled.
s1.plot();s2.plot()