Question

我是python中的'新手'（2周前开始学习），我正在尝试绘制一个看起来像这样的文件：

"1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

使用以下任何一项：pandas，matplotlib和numpy。但我没有取得多大成功。我尝试搜索示例但没有应用于我的数据格式。

有人可以帮我找出如何在pandas数据框中加载这个文件，或者（甚至会更好）向我展示如何绘制这个？

详细信息：

对于我拥有的不同数据集，每个系列中的行数不同，但在同一数据集（文件）中，行数相同（如代码摘录中所示）。
每个系列之间都有一个空白行，与代码摘录中的重现完全相同。
系列的标题是整个字符串，但只有一个字（如果使用两个单词/列更容易导入），我可以更改标题。

更新1：

在@Goyo的帮助下，我将方法convert()更改为：

#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
    line = line.strip()
    print(line)
    if line == "":
        continue
    if line.startswith('"'):
        name = line.strip('"')
        print("NAME:: " + name)
    else:
        out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )

要绘制我正在使用以下代码：

with open('nro_caribou.dat') as in_file:
    with open('output.txt', 'w+') as out_file:
       convert(in_file, out_file)
    df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
    print(df)
    df.pivot(values='Time', index='Param', columns='Methods').plot()

我的原始数据：https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83

我的情节：

Answer 1

AFAIK在pandas，matplotlib或numpy中没有内置功能来读取像那样的文件。如果你对数据格式有一些控制权，我建议你改变它。

如果您没有选择但使用该格式，您可以使用python I / O和字符串操作功能自己解析数据（我不认为pandas可以使这更容易，它不是为了处理这些类型而设计的文件）。

此功能可以将数据格式转换为更适合pandas的数据：

def convert(in_file, out_file):
    for line in in_file:
        line = line.rstrip(' \n\r') 
        if not line:
            continue
        if line.startswith('"'):
            name = line.strip('"')
        else:
            out_file.write('{}\n'.format(','.join([name] + line.split())))

如果您的原始文件是＆＃39; input.txt＆＃39;你会这样使用它：

with open('input.txt') as in_file:
    with open('output.txt', 'w') as out_file:
        convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
                 names=['Series', 'X', 'Y'])
print(df)

      Series  X    Y
0  1st Serie  2   23
1  1st Serie  4   12
2  1st Serie  6   12
3  2nd Serie  2   51
4  2nd Serie  4   90
5  2nd Serie  6  112

df.pivot(index='X', columns='Series', values='Y').plot()

Answer 2

我认为您只能read_csv一次，然后发布处理创建dataframe：

import pandas as pd
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112
"""

s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True,
                 header=None) #try convert output to series

print s

"1stSerie"    NaN
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
Name: 0, dtype: float64

df = s.reset_index()
#set column names
df.columns = ['idx','val']

#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0   NaN
1     2
2     4
3     6
4   NaN
5     2
6     4
7     6
Name: idx, dtype: float64

#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
Name: idx, dtype: bool

#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']

#forward fill NaN values
df['names'] = df['names'].ffill()

#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]

print df
  idx  val       names
1   2   23  "1stSerie"
2   4   12  "1stSerie"
3   6   12  "1stSerie"
5   2   51  "2ndSerie"
6   4   90  "2ndSerie"
7   6  112  "2ndSerie"

df.pivot(index='idx', columns='names', values='val').plot()

或者您可以使用read_csv和plot。如果您需要将Series设置为legend，请使用figure和legend：

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112"""

s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 nrows=3, #read only 3 rows of data
                 squeeze=True) #try convert output to series

print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: int64

#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", 
                 header=4, #read row 4 to header - series name
                 engine='python',      
                 nrows=3,
                 squeeze=True)

print s2
2     51
4     90
6    112
Name: "2ndSerie", dtype: int64

plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])

或者您只能阅读一次文件，然后将Serie s剪切为Series s1，s2和s3，然后创建{ {1}}：

DataFrame

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

"3rdSerie"
2 51
4 90
6 112
"""

s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True) #try convert output to series

print s
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
"3rdSerie"    NaN
2              51
4              90
6             112
Name: "1stSerie", dtype: float64

Answer 3

您可以使用itertools.groupby单步执行该文件。下面的LastHeader类检查每一行的sentinal字符。如果角色在那里，标题行会更新，itertools.groupby会开始新的细分。唯一一个遇到数据集问题的地方就是你有两个标有＆＃34; CRE＆＃34;的系列。我的解决方法是从文本文件中删除第二个，但您可能想要做其他事情。

这里的结果是你可以在一次通过中隐藏数据。没有必要的写作和回读。

from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pandas import DataFrame, Series

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''
        self.index=0

    def check(self, line):
        self.index += 1
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

fname = 'dist_caribou.dat'

with open(fname, 'r') as fobj:
    lastheader = LastHeader('"')
    data = []
    for headerline, readlines in groupby(fobj, lastheader.check):
        name = headerline.strip().strip('"')
        thisdat = np.loadtxt(readlines, comments='"')
        data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()

Answer 4

鉴于pandas中read_csv的适当参数，这对绘图来说相对微不足道。

s1 = pd.read_csv('series1.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)
>>> s1
tSerie
2    23
4    12
6    12
Name: Unnamed: 1, dtype: int64

s2 = pd.read_csv('series2.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)

%matplotlib inline  # If not already enabled.
s1.plot();s2.plot()

Python Matplotlib - 同一数据文件中的多个系列

更新1：

4 个答案: