熊猫的问题

时间:2015-03-27 16:06:45

标签: python csv pandas

对于模糊的标题感到抱歉,但由于我不知道问题是什么......我想加载一个CSV文件,然后将其分成两个数组并执行一个函数每个阵列。它适用于第一个阵列,但第二个阵列即使每个东西都是相同的也会产生问题。我真的卡住了。守则如下:

from wordutility import wordutility
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np

data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',
               quotechar='"')

# test = pd.read_csv('output.csv', header=None,
#                   delimiter=';', quotechar='"')

split_ratio = 0.9
train = data[:round(len(data)*split_ratio)]
test = data[round(len(data)*split_ratio):]

y = data[1]

print("Cleaning and parsing tweets data...\n")

traindata = []

for i in range(0, len(train[0])):
     traindata.append(" ".join(wordutility.tweet_to_wordlist
                          (train[0][i], False)))

testdata = []

for i in range(0, len(test[0])):
    testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))

程序一直运行到最后一行。错误是:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line   1417, in get_value
    return self._engine.get_value(s, k)
  File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)
  File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)
  File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)
  File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139)
KeyError: 0

(它在错误代码中说第2行,因为我在python shell中尝试了代码。所以第2行引用了上面代码的最后一行。)

希望有人可以帮助我:)。 感谢

修改

好吧,看起来分裂并没有像我想象的那样起作用。我确实得到了两个我想要的数组,但不知怎的,这些行仍然好像是一个文件。因此阵列列车从0到1830,阵列测试从1831年到2034年...所以范围是错误的......我将如何分割csv文件&#34;正确&#34;?

2编辑

>>> print(train[0:5])
                                               0         1
0  the angel is going to miss the athlete this we...  negative 
1  It looks as though Shaq is getting traded to C...  negative
2     @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH   negative
3  drinking a McDonalds coffee and not understand...  negative
4  So dissapointed Taylor Swift doesnt have a Twi...  negativ

>>> print(test[0:5])
                                                  0         1
1831  Why is my PSP always dead when I want to use it?   negative
1832  @hillaryrachel oh i know how you feel. i took ...  negative
1833  @daveknox awesome-  corporate housing took awa...  negative
1834  @lakersnation Is this a joke?  I can't find them   negative
1835                              XBox Live still down   negative

所以你可以看到数组&#34; test&#34;从第1831行开始。我以为它会从0开始......我现在通过编辑for循环中的范围来解决我的问题

for i in range(len(train[0], len(data)):

所以我的原始问题已修复,我只是好奇并渴望学会编写更好的代码。这是一件好事,还是应该以不同的方式拆分csv文件?

1 个答案:

答案 0 :(得分:1)

当您执行test[0]时,您没有获得test的第一个索引,更像是您使用&#34;名称&#34获取test列; 0。将pandas DataFrame分成两部分时,会保留原始列名。这意味着对于test DataFrame,它没有列0,因为该列位于第一个DataFrame中。

让我举个例子。假设您有以下DataFrame:

       0   1   2   3   4   5   6   7   8   9
Ind1   0   1   2   3   4   5   6   7   8   9
Ind2  10  11  12  13  14  15  16  17  18  19

拆分后,最终会得到这些DataFrame:

       0   1   2   3   4
Ind1   0   1   2   3   4
Ind2  10  11  12  13  14

       5   6   7   8   9
Ind1   5   6   7   8   9
Ind2  15  16  17  18  19

请注意,第二个DataFrame的列以5开头,而不是0,因为这些是拆分前的列名。因此,当您尝试获取列0时,它就不存在了。这是您错误的来源。

最简单的解决方案就是使用索引而不是列名。因此,请使用test[0]而不是test.iloc[0]之类的内容。这将根据位置指数给出值。