Python:获取TypeError:调用函数

时间:2016-04-17 03:50:15

标签: python pandas dataframe nltk data-analysis

我有一个文本文件,使用以下命令转换为dataframe:

 df = pd.read_csv("C:\\Users\\Sriram\\Desktop\\New folder (4)\\aclImdb\\test\\result.txt", sep = '\t', 
             names=['reviews','polarity'])

此处的评论栏包含所有电影评论,极性栏包含评论是正面还是负面。

我有以下功能功能,我需要传递来自数据框的评论专栏(近1000篇评论)。

def find_features(document):
words = word_tokenize(document)
features = {}
for w in word_features:
    features[w] = (w in words)
return features

我正在使用以下功能创建训练数据集。

trainsets = [find_features(df.reviews), df.polarity]

因此,通过这样做,我的评论栏中的所有单词都将作为find_feature中的标记化函数的结果进行拆分,并将被赋予极性(正面或负面)。

例如:

        reviews                           polarity
  This is a poor excuse for a movie        negative

对于上面的情况,在调用find_features函数之后,如果函数内部的方法满足,我将得到输出:

  poor    -  negative
  excuse  -  negative

依旧......

当我试图调用此函数时,我收到以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-79-76f9090c0532> in <module>()
 30     return features
 31 
---> 32 featuresets = [find_features(df.reviews), df.polarity]
 33 #featuresets = [(find_features(rev), category) for ((rev, category)) in   
 reviews]
 34 '''

 <ipython-input-79-76f9090c0532> in find_features(document)
 24 
 25 def find_features(document):
 ---> 26     words = word_tokenize(document)
 27     features = {}
 28     for w in word_features:

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in   
word_tokenize(text, language)
102     :param language: the model name in the Punkt corpus
103     """
 --> 104     return [token for sent in sent_tokenize(text, language)
105             for token in _treebank_word_tokenize(sent)]
106 

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in 
sent_tokenize(text, language)
 87     """
 88     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
 ---> 89     return tokenizer.tokenize(text)
 90 
 91 # Standard word tokenizer.

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in 
tokenize(self, text, realign_boundaries)
1224         Given a text, returns a list of the sentences in that text.
1225         """
-> 1226         return list(self.sentences_from_text(text,  
realign_boundaries))
1227 
1228     def debug_decisions(self, text):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in   
sentences_from_text(self, text, realign_boundaries)
1272         follows the period.
1273         """
-> 1274         return [text[s:e] for s, e in self.span_tokenize(text,     
realign_boundaries)]
1275 
1276     def _slices_from_text(self, text):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in 
span_tokenize(self, text, realign_boundaries)
1263         if realign_boundaries:
1264             slices = self._realign_boundaries(text, slices)
-> 1265         return [(sl.start, sl.stop) for sl in slices]
1266 
1267     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in  
<listcomp>(.0)
1263         if realign_boundaries:
1264             slices = self._realign_boundaries(text, slices)
-> 1265         return [(sl.start, sl.stop) for sl in slices]
1266 
1267     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in 
_realign_boundaries(self, text, slices)
1302         """
1303         realign = 0
-> 1304         for sl1, sl2 in _pair_iter(slices):
1305             sl1 = slice(sl1.start + realign, sl1.stop)
1306             if not sl2:

 C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in 
 _pair_iter(it)
 308     """
 309     it = iter(it)
 --> 310     prev = next(it)
 311     for el in it:
 312         yield (prev, el)

 C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in    
_slices_from_text(self, text)
1276     def _slices_from_text(self, text):
1277         last_break = 0
-> 1278         for match in   
self._lang_vars.period_context_re().finditer(text):
1279             context = match.group() + match.group('after_tok')
1280             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

如何直接从具有多行值的数据框中调用函数(在我的案例评论中)?

1 个答案:

答案 0 :(得分:0)

按照您提到的预期输出: while ($temp = mysql_fetch_assoc($FoodQuery)) { echo "<option value=".$temp['foodID'].">".$temp['food_name']."</option>"; } 我会建议: poor - negative excuse - negative

为ref:

添加示例代码段

trainsets = df.apply(lambda row: ([(kw, row.polarity) for kw in find_features(row.reviews)]), axis=1)

输出:

import pandas as pd
from StringIO import StringIO

print 'pandas-version: ', pd.__version__
data_str = """
col1,col2
'leoperd lion tiger','non-veg'
'buffalo antelope elephant','veg'
'dog cat crow','all'
"""
data_str = StringIO(data_str)
# a dataframe with 2 columns
df = pd.read_csv(data_str)

# a dummy function taking a col1 value from each row
# and splits it into multiple values & returns a list
def my_fn(row_val):
    return row_val.split(' ')

# calling row-wise apply vetor operation on dataframe
train_set = df.apply(lambda row: ([(kw, row.col2) for kw in my_fn(row.col1)]), axis=1)
print train_set

@SriramChandramouli,希望我能正确理解你的要求。