试图解析.dat文件并存储到Pandas

时间:2017-10-09 07:05:32

标签: python pandas parsing

我有一个解析类似格式文件的例子:
数据示例(.data):

+ Naoki Abe
- Myriam Abramson
+ David W. Aha
+ Kamal M. Ali
- Eric Allender

这是将代码存储到2D数组中的python示例:

df = pd.read_csv(
    filepath_or_buffer='path/to/.data/file', 
    header=None, 
    sep=',')

# separate names from classes
vals = df.loc[:,:].values
names = [n[0][2:] for n in vals]
cls = [n[0][0] for n in vals]

根据我的理解,这个python代码意味着数据将是变量df并提取与vals变量中每个人相关联的字符串数据。然后,它将vals的字符串拆分为namescls。并且namescls列表应该包含这些组件,以使第i个人的姓名位于names[i]及其相关联的班级cls[i]中。

但是,当我想使用类似的方法来解析另一个类似的数据集(.dat)时,

-1  this is comment1 blah blah blah (it is a big paragraph)
-1  this is comment2 blah blah blah (it is a big paragraph)
-1  this is comment3 blah blah blah (it is a big paragraph)

因此,我将示例修改为:

# read in the dataset
df = pd.read_csv(
    engine='python',
    filepath_or_buffer='data/Pro1/train.dat', 
    header=None, 
    sep='\t+')

# separate names from classes
vals = df.loc[:,:].values
comm = [n[0][2:] for n in vals]
rates = [n[:1][0] for n in vals]  

我收到了TypeError: 'long' object has no attribute '__getitem__' comm = [n[0][2:] for n in vals]的错误消息:.dat 我搜索了错误消息,它解释说这意味着我试图将int存储到字符串(?)中。我试图存储整段评论,它是一个字符串。在示例中,它存储了一串名称就好了。 我的另一个问题是因为我必须解析一个TAB文件,我猜它-1后面是index而不是空格,我不确定我设置的数组的范围是正确的。**

我的经验:我不是python的专家,你可能已经想到了,我绝对可以阅读代码,但在编写代码时必须进行研究。 Python是我现在唯一选择进行此类数据分析的选择。

1 个答案:

答案 0 :(得分:0)

第一个文件中没有逗号分隔符,因此,文件中的每一行都会生成一个字符串,例如'+ Naoki Abe'。因此,您可以使用字符串切片将名称与其余字符串分开。

>>> import pandas as pd
>>> df = pd.read_csv('temp.csv', header=None, sep=',')
>>> vals = df.loc[:,:].values
>>> vals
array([['+ Naoki Abe'],
       ['- Myriam Abramson'],
       ['+ David W. Aha'],
       ['+ Kamal M. Ali'],
       ['- Eric Allender']], dtype=object)
>>> names = [n[0][2:] for n in vals]
>>> names
['Naoki Abe', 'Myriam Abramson', 'David W. Aha', 'Kamal M. Ali', 'Eric Allender']
>>> cls = [n[0][0] for n in vals]
>>> cls
['+', '-', '+', '+', '-']

我也怀疑有一个制表符将-1与每行的其余部分分开。结果是pandas在选项卡上分割每一行。在这种情况下,一旦将选项卡声明为分隔符,就不能使用字符串切片。

>>> df2 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\t')
>>> vals2 = df2.loc[:,:].values
>>> vals2
array([[-1, 'this is comment1 blah blah blah (it is a big paragraph)'],
       [-1, 'this is comment2 blah blah blah (it is a big paragraph)'],
       [-1, 'this is comment3 blah blah blah (it is a big paragraph)']], dtype=object)
>>> first = [val[0] for val in vals2]
>>> first
[-1, -1, -1]
>>> second = [val[1] for val in vals2]
>>> second
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)']

但不要绝望!

有一种方法可以以类似的方式处理两个数据文件。

使用sep='\s+'以便类似地处理制表符和空格。然后,pandas会将每一行转换为字符串列表。您现在需要做的就是摘下第一个项目并重新组装其他项目。

>>> df3 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\s+')
>>> vals3 = df3.loc[:,:].values
>>> vals3
array([[-1, 'this', 'is', 'comment1', 'blah', 'blah', 'blah', '(it', 'is',
        'a', 'big', 'paragraph)'],
       [-1, 'this', 'is', 'comment2', 'blah', 'blah', 'blah', '(it', 'is',
        'a', 'big', 'paragraph)'],
       [-1, 'this', 'is', 'comment3', 'blah', 'blah', 'blah', '(it', 'is',
        'a', 'big', 'paragraph)']], dtype=object)
>>> first = [val[0] for val in vals3]
>>> first
[-1, -1, -1]
>>> second = [' '.join(val[1:]) for val in vals3]
>>> second
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)']

我的最后一句话:我质疑你对csv模块使用pandas。