Question

我想从我的CSV文件中删除标记有特定词性标记VBD和VBN的字词。但是，输入以下代码后，我收到错误“IndexError：list index out of range”：

for word in POS_tag_text_clean:
    if word[1] !='VBD' and word[1] !='VBN':
        words.append(word[0])

我的CSV文件有10条评论10人，行名称为Comment。

这是我的完整代码：

df_Comment = pd.read_csv("myfile.csv")

def clean(text):
    stop = set(stopwords.words('english'))
    exclude = set(string.punctuation)
    lemma = WordNetLemmatizer()
    tagged = nltk.pos_tag(text)

    text = text.rstrip()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

text_clean = []
for text in df)Comment['Comment']:
    text_clean.append(clean(text).split())
print(text_clean) 

POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)


words=[]
for word in POS_tag_text_clean:
    if word[1] !='VBD' and word[1] !='VBN':
       words.append(word[0])

如何解决错误？

Answer 1

如果没有示例和相应的输出，有点难以理解您的问题，但可能是这样：

假设text是一个字符串，text_clean将是一个字符串列表列表，其中每个字符串代表一个单词。在词性标记之后，POS_tag_text_clean将是元组列表的列表，每个元组包含一个单词及其标记。

如果我是对的，那么你的最后一个循环实际上会循环数据框中的项而不是单词，就像变量的名称所暗示的那样。如果某个项目只有一个单词（由于您在clean()过滤了很多内容，因此不太可能），您对word[1]的调用将失败，并显示与您报告的错误类似的错误。

相反，请尝试以下代码：

words = []
for item in POS_tag_text_clean:
   words_in_item = []
   for word in item:
      if word[1] !='VBD' and word[1] !='VBN':
         words_in_item .append(word[0])
   words.append(words_in_item)

如何从CSV文件中删除除“VBD”和“VBN”之外的所有POS标签？

1 个答案: