我想导入一个txt文件,如下所示:
0 @switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
4 my whole body feels itchy and like its on fire
4 @nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
0 @Kwesidei not the whole crew
所需的返回值是一个numpy.array,其中包含两列sentiment='0' or '4'
和tw='string'
。但它一直给我错误。有人可以帮忙吗?
Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])
答案 0 :(得分:0)
表达式的错误是
ValueError: mismatch in size of old and new data-descriptor
如果我使用dtype=None
,我会
ValueError: Some errors were detected !
Line #2 (got 22 columns instead of 20)
Line #3 (got 19 columns instead of 20)
Line #4 (got 11 columns instead of 20)
Line #5 (got 22 columns instead of 20)
Line #6 (got 6 columns instead of 20)
从'white space'分隔符开始工作,它将每行分成20,22等字段。文本中的空格是分隔符,就像第一个一样。
一个选项是编辑文件,并用一些唯一的分隔符替换第一个空格。另一种选择是使用分隔符的字段长度版本。经过一些实验,这个加载看起来很合理(这是Py3,所以我使用的是Unicode字符串dtype)。
In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]:
array([ (0, "@switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third D"),
(0, "is upset that he can't update his Facebook by texting it... and might cry as a result School today "),
(0, '@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n'),
(4, 'my whole body feels itchy and like its on fire\n'),
(4, "@nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
(0, '@Kwesidei not the whole crew')],
dtype=[('sentiment', '<i4'), ('tw', '<U100')])