我有一个解析类似格式文件的例子:
数据示例(.data):
+ Naoki Abe
- Myriam Abramson
+ David W. Aha
+ Kamal M. Ali
- Eric Allender
这是将代码存储到2D数组中的python示例:
df = pd.read_csv(
filepath_or_buffer='path/to/.data/file',
header=None,
sep=',')
# separate names from classes
vals = df.loc[:,:].values
names = [n[0][2:] for n in vals]
cls = [n[0][0] for n in vals]
根据我的理解,这个python代码意味着数据将是变量df
并提取与vals
变量中每个人相关联的字符串数据。然后,它将vals
的字符串拆分为names
和cls
。并且names
和cls
列表应该包含这些组件,以使第i个人的姓名位于names[i]
及其相关联的班级cls[i]
中。
但是,当我想使用类似的方法来解析另一个类似的数据集(.dat)时,
-1 this is comment1 blah blah blah (it is a big paragraph)
-1 this is comment2 blah blah blah (it is a big paragraph)
-1 this is comment3 blah blah blah (it is a big paragraph)
因此,我将示例修改为:
# read in the dataset
df = pd.read_csv(
engine='python',
filepath_or_buffer='data/Pro1/train.dat',
header=None,
sep='\t+')
# separate names from classes
vals = df.loc[:,:].values
comm = [n[0][2:] for n in vals]
rates = [n[:1][0] for n in vals]
我收到了TypeError: 'long' object has no attribute '__getitem__'
comm = [n[0][2:] for n in vals]
的错误消息:.dat
我搜索了错误消息,它解释说这意味着我试图将int存储到字符串(?)中。我试图存储整段评论,它是一个字符串。在示例中,它存储了一串名称就好了。
我的另一个问题是因为我必须解析一个TAB
文件,我猜它-1
后面是index
而不是空格,我不确定我设置的数组的范围是正确的。**
我的经验:我不是python的专家,你可能已经想到了,我绝对可以阅读代码,但在编写代码时必须进行研究。 Python是我现在唯一选择进行此类数据分析的选择。
答案 0 :(得分:0)
第一个文件中没有逗号分隔符,因此,文件中的每一行都会生成一个字符串,例如'+ Naoki Abe'。因此,您可以使用字符串切片将名称与其余字符串分开。
>>> import pandas as pd
>>> df = pd.read_csv('temp.csv', header=None, sep=',')
>>> vals = df.loc[:,:].values
>>> vals
array([['+ Naoki Abe'],
['- Myriam Abramson'],
['+ David W. Aha'],
['+ Kamal M. Ali'],
['- Eric Allender']], dtype=object)
>>> names = [n[0][2:] for n in vals]
>>> names
['Naoki Abe', 'Myriam Abramson', 'David W. Aha', 'Kamal M. Ali', 'Eric Allender']
>>> cls = [n[0][0] for n in vals]
>>> cls
['+', '-', '+', '+', '-']
我也怀疑有一个制表符将-1与每行的其余部分分开。结果是pandas在选项卡上分割每一行。在这种情况下,一旦将选项卡声明为分隔符,就不能使用字符串切片。
>>> df2 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\t')
>>> vals2 = df2.loc[:,:].values
>>> vals2
array([[-1, 'this is comment1 blah blah blah (it is a big paragraph)'],
[-1, 'this is comment2 blah blah blah (it is a big paragraph)'],
[-1, 'this is comment3 blah blah blah (it is a big paragraph)']], dtype=object)
>>> first = [val[0] for val in vals2]
>>> first
[-1, -1, -1]
>>> second = [val[1] for val in vals2]
>>> second
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)']
但不要绝望!
有一种方法可以以类似的方式处理两个数据文件。
使用sep='\s+'
以便类似地处理制表符和空格。然后,pandas会将每一行转换为字符串列表。您现在需要做的就是摘下第一个项目并重新组装其他项目。
>>> df3 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\s+')
>>> vals3 = df3.loc[:,:].values
>>> vals3
array([[-1, 'this', 'is', 'comment1', 'blah', 'blah', 'blah', '(it', 'is',
'a', 'big', 'paragraph)'],
[-1, 'this', 'is', 'comment2', 'blah', 'blah', 'blah', '(it', 'is',
'a', 'big', 'paragraph)'],
[-1, 'this', 'is', 'comment3', 'blah', 'blah', 'blah', '(it', 'is',
'a', 'big', 'paragraph)']], dtype=object)
>>> first = [val[0] for val in vals3]
>>> first
[-1, -1, -1]
>>> second = [' '.join(val[1:]) for val in vals3]
>>> second
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)']
我的最后一句话:我质疑你对csv模块使用pandas。