词袋方法将消息分为单个词

时间:2019-01-07 19:41:26

标签: python textblob

我正在尝试将一条消息拆分成各个单词,并尝试标记这些消息。

def split_into_tokens(message):
    message = unicode(message, 'utf8')  # convert bytes into proper unicode
    return TextBlob(message).words

messages.message.head().apply(split_into_tokens)

如果显示nameError:名称“ unicode”未定义

  <ipython-input-16-98e123c365b4> in <module>()
----> 1 messages.title.head().apply(split_into_tokens)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in 
apply(self, func, convert_dtype, args, **kwds)
  3192             else:
  3193                 values = self.astype(object).values
->3194                 mapped = lib.map_infer(values, f, 
convert=convert_dtype)
   3195 
   3196         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-14-281c1d080655> in split_into_tokens(title)
      1 def split_into_tokens(title):
----> 2 title = unicode(title, utf8)  # convert bytes into proper 
      unicode
      3     return TextBlob(title).words

NameError: name 'unicode' is not defined

最后显示未定义的unicode,尝试更改python版本的即时消息也保持不变。我是否需要在python插件目录中用str替换unicode?

1 个答案:

答案 0 :(得分:0)

我假设您使用的是python 3,所以只需尝试删除行def ParseArray(l): #parses line in socke day = (l.split()[+0] + '') # Gets Day month = (l.split()[+1] + '') # Gets Month year = (l.split()[+3] + '') # Gets Year time = (l.split()[+2] + '') # Gets Time device = (l.split()[-2] + '') # Gets Device Id = (l.split()[+9] + '') # Gets ID ap = (l.split()[+18] + '') # Gets AP ApGroup = (l.split()[+19] + '') # Gets AP Group MacAdd = (l.split()[+16] + '') # Gets MAC #print (day, month, year, time, device, Id, ap, ApGroup, MacAdd) #insert line into db else by primary key (ID) #update line to db if ID doesn't exist #pringle = ['Dec', '11', '2018', '15:10:51', 'iPhone', '[jeref2]', # 'home', 'hm1', '45.88.34.58)\n'] sql = "INSERT INTO SocketTest (month, day, year, time, device, Id, ap, ApGroup, MacAdd) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);" cur.execute(sql, (day, month, year, time, device, Id, ap, ApGroup, MacAdd)) con.commit() –您的message = unicode(message, 'utf8')变量可能已经是unicode字符串了。如果不是,则可能是message对象,在这种情况下,将其转换为python 3下的unicode字符串的正确方法是bytes。如需更多信息,请参见https://docs.python.org/3/howto/unicode.html