Question

我导入了一个带有pandas的数据集（.csv）。第一列是带有推文的列，我重命名它并像往常一样将其转换为numpy数组.values。然后我用NLTK开始预处理，除了这个数据集之外，它几乎每次都有效。它给了我错误TypeError：期望的字符串或类似字节的对象，我无法弄清楚为什么。该文本包含一些奇怪的东西，但远非我见过的最糟糕的东西。有人可以帮忙吗？

data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values

X = []
for i in range(0, len(text)):
    tweet = re.sub("[^a-zA-Z]", " ", text[i])
    tweet = tweet.lower()
    tweet = tweet.split()
    ps = PorterStemmer()
    tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    X.append(tweet)

给我这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
      1 text_train = []
      2 for i in range(0, len(text)):
----> 3     tweet = re.sub("[^a-zA-Z]", " ", text[i])
      4     tweet = tweet.lower()
      5     tweet = tweet.split()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

这是数据集 http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv

NLTK给出错误预期的字符串或字节对象

0 个答案: