Question

我正在尝试从作为段落的pandas列中仅提取一个选择的单词列表，如果它们存在，则仅创建一列这些单词（这是指标列表）。当我应用自定义函数时，我会不断获得随机批次的字母。这是我对一个不起作用的函数的尝试：

indicators = "|".join(("banana tree", "climate change", "warming", "dinosaurs"))

def indication_find(x):
    for words in x:
         if words in indicators:
            return words
         else:
            pass

df["indicators"] = df["text"].apply(indication_find)

输入将是学生写的几个句子，输出将只是我在列表中过滤的那些单词。

Answer 1

您需要对代码进行一些修改。指标应该是一个字符串列表。你所拥有的是一个大字符串，当你遍历它时，它将遍历该字符串中的每个字母而不是单词。所以这样做：

indicators = ["banana tree", "climate change", "warming", "dinosaurs"]

在自定义函数中，x将是一个包含整个段落的字符串。因此，您需要按空格分割，以便获得单词列表。

def indication_find(x):
    list_of_words = x.split(' ')
    out_data = [] # initialize an empty list
    for word in list_of_words :
        # remove punctuations with strip() 
         if word.strip('.,!?') in indicators:
            out_data.append(word)
    return str(out_data)

df["indicators"] = df["text"].apply(indication_find)

创建一个pandas列，其中包含实例的文本功能中的单词列表

1 个答案: