我导入了一个带有pandas的数据集(.csv)。第一列是带有推文的列,我重命名它并像往常一样将其转换为numpy数组.values。然后我用NLTK开始预处理,除了这个数据集之外,它几乎每次都有效。它给了我错误TypeError:期望的字符串或类似字节的对象,我无法弄清楚为什么。该文本包含一些奇怪的东西,但远非我见过的最糟糕的东西。有人可以帮忙吗?
data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values
X = []
for i in range(0, len(text)):
tweet = re.sub("[^a-zA-Z]", " ", text[i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
X.append(tweet)
给我这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
1 text_train = []
2 for i in range(0, len(text)):
----> 3 tweet = re.sub("[^a-zA-Z]", " ", text[i])
4 tweet = tweet.lower()
5 tweet = tweet.split()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
这是数据集 http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv