TypeError:需要一个类似字节的对象,而不是pd.read_csv为“ str”

时间:2019-04-22 02:43:52

标签: python pandas numpy youtube-api

我正在尝试从以下网站获取代码:https://datanice.wordpress.com/2015/09/09/sentiment-analysis-for-youtube-channels-with-nltk/

我遇到错误的代码是:

import nltk
from nltk.probability import *
from nltk.corpus import stopwords
import pandas as pd

all = pd.read_csv("comments.csv")

stop_eng = stopwords.words('english')
customstopwords =[]

tokens = []
sentences = []
tokenizedSentences =[]
for txt in all.text:
    sentences.append(txt.lower())
    tokenized = [t.lower().encode('utf-8').strip(":,.!?") for t in txt.split()]
    tokens.extend(tokenized)
    tokenizedSentences.append(tokenized)

hashtags = [w for w in tokens if w.startswith('#')]
ghashtags = [w for w in tokens if w.startswith('+')]
mentions = [w for w in tokens if w.startswith('@')]
links = [w for w in tokens if w.startswith('http') or w.startswith('www')]
filtered_tokens = [w for w in tokens if not w in stop_eng and not w in customstopwords and w.isalpha() and not len(w)<3 and not w in hashtags and not w in ghashtags and not w in links and not w in mentions]

fd = FreqDist(filtered_tokens)

这给了我以下错误:

tokenized = [t.lower().encode('utf-8').strip(":,.!?") for t in txt.split()]
TypeError: a bytes-like object is required, not 'str'

我正在使用以下代码获取csv:

commentDataCsv = pd.DataFrame.from_dict(callFunction).to_csv("comments4.csv", encoding='utf-8')

我已将所有pd.read_json("comments.csv")替换为read_csv

1 个答案:

答案 0 :(得分:0)

在Py3中,默认的字符串类型是unicode。 encode将其转换为字节串。要将strip应用于字节串,您需要提供一个匹配的字符:

In [378]: u'one'.encode('utf-8')                                                     
Out[378]: b'one'
In [379]: 'one'.encode('utf-8').strip(':')                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-379-98728e474af8> in <module>
----> 1 'one'.encode('utf-8').strip(':')

TypeError: a bytes-like object is required, not 'str'

In [381]: 'one:'.encode('utf-8').strip(b':')                                         
Out[381]: b'one'

如果您不先编码,则可以使用默认的Unicode字符

In [382]: 'one:'.strip(':')                                                          
Out[382]: 'one'

我建议您采用这种方式,否则您的其余代码将需要b令牌。