Question

我正在尝试简单地处理一些我希望计算数据集中产生的最常用单词的推特数据。

但是，我在第45行上一直收到以下错误：

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
 43 for line in f:
 44 parts = re.split("^\d+\s", line)
 45 tweet = re.split("\s(Status)", parts[-1])[10]
 46 tweet = tweet.replace("\\n"," ")
 47 terms_all = [term for term in process_tweet(tweet)]
 IndexError: list index out of range

我已添加完整的代码供审核，有人可以提供建议。

    import codecs
import re
from collections import Counter
from nltk.corpus import stopwords

word_counter = Counter()

def punctuation_symbols():
    return [".", "", "$","%","&",";",":","-","&amp;","?"]

def is_rt_marker(word):
    if word == "b\"rt" or word == "b'rt" or word == "rt":
        return True
    return False

def strip_quotes(word):
    if word.endswith(""):
        word = word[0:-1]
    if word.startswith(""):
        word = word[1:]
    return word

def process_tweet(tweet):
    keep = []
    for word in tweet.split(" "):
        word = word.lower()
        word = strip_quotes(word)
        if len(word) == 0:
            continue
        if word.startswith("https"):
            continue
        if word in stopwords.words('english'):
            continue
        if word in punctuation_symbols():
            continue
        if is_rt_marker(word):
            continue
        keep.append(word)
    return keep

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0
    for line in f:
        parts = re.split("^\d+\s", line)
        tweet = re.split("\s(Status)", parts[1])[0]
        tweet = tweet.replace("\\n"," ")
        terms_all = [term for term in process_tweet(tweet)]
        word_counter.update(terms_all)

        n += 1
        if n == 50:
            break

print(word_counter.most_common(10))

Answer 1

parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]

这些可能是有问题的界限。

您认为parts已拆分且包含多个元素。拆分可能无法在line中找到拆分字符串，因此parts等于[line]。然后parts[1]崩溃。

在第二行之前添加一个检查。打印line值以更好地了解会发生什么。

读取CSV文件时列出索引超出范围

1 个答案: