导入带有两个混合列的Txt文件

时间:2017-03-12 23:49:41

标签: numpy sentiment-analysis

我想导入一个txt文件,如下所示:

0 @switchfoot http://twitpic.com/2y1zl - Awww  that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!
0 @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds
4 my whole body feels itchy and like its on fire 
4 @nationwideclass no  it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. 
0 @Kwesidei not the whole crew 

所需的返回值是一个numpy.array,其中包含两列sentiment='0' or '4'tw='string'。但它一直给我错误。有人可以帮忙吗?

Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])

1 个答案:

答案 0 :(得分:0)

表达式的错误是

ValueError: mismatch in size of old and new data-descriptor

如果我使用dtype=None,我会

ValueError: Some errors were detected !
    Line #2 (got 22 columns instead of 20)
    Line #3 (got 19 columns instead of 20)
    Line #4 (got 11 columns instead of 20)
    Line #5 (got 22 columns instead of 20)
    Line #6 (got 6 columns instead of 20)

从'white space'分隔符开始工作,它将每行分成20,22等字段。文本中的空格是分隔符,就像第一个一样。

一个选项是编辑文件,并用一些唯一的分隔符替换第一个空格。另一种选择是使用分隔符的字段长度版本。经过一些实验,这个加载看起来很合理(这是Py3,所以我使用的是Unicode字符串dtype)。

In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]: 
array([ (0, "@switchfoot http://twitpic.com/2y1zl - Awww  that's a bummer.  You shoulda got David Carr of Third D"),
       (0, "is upset that he can't update his Facebook by texting it... and might cry as a result  School today "),
       (0, '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds\n'),
       (4, 'my whole body feels itchy and like its on fire\n'),
       (4, "@nationwideclass no  it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
       (0, '@Kwesidei not the whole crew')], 
      dtype=[('sentiment', '<i4'), ('tw', '<U100')])