我正在尝试简单地处理一些我希望计算数据集中产生的最常用单词的推特数据。
但是,我在第45行上一直收到以下错误:
IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
43 for line in f:
44 parts = re.split("^\d+\s", line)
45 tweet = re.split("\s(Status)", parts[-1])[10]
46 tweet = tweet.replace("\\n"," ")
47 terms_all = [term for term in process_tweet(tweet)]
IndexError: list index out of range
我已添加完整的代码供审核,有人可以提供建议。
import codecs
import re
from collections import Counter
from nltk.corpus import stopwords
word_counter = Counter()
def punctuation_symbols():
return [".", "", "$","%","&",";",":","-","&","?"]
def is_rt_marker(word):
if word == "b\"rt" or word == "b'rt" or word == "rt":
return True
return False
def strip_quotes(word):
if word.endswith(""):
word = word[0:-1]
if word.startswith(""):
word = word[1:]
return word
def process_tweet(tweet):
keep = []
for word in tweet.split(" "):
word = word.lower()
word = strip_quotes(word)
if len(word) == 0:
continue
if word.startswith("https"):
continue
if word in stopwords.words('english'):
continue
if word in punctuation_symbols():
continue
if is_rt_marker(word):
continue
keep.append(word)
return keep
with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f:
n = 0
for line in f:
parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]
tweet = tweet.replace("\\n"," ")
terms_all = [term for term in process_tweet(tweet)]
word_counter.update(terms_all)
n += 1
if n == 50:
break
print(word_counter.most_common(10))
答案 0 :(得分:-1)
parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]
这些可能是有问题的界限。
您认为parts
已拆分且包含多个元素。拆分可能无法在line
中找到拆分字符串,因此parts
等于[line]
。然后parts[1]
崩溃。
在第二行之前添加一个检查。打印line
值以更好地了解会发生什么。