Question

您好，我尝试将“聊天”的每个内容转换为令牌，这是我的熊猫数据框中长度为1000的列

text=df["Chat"]
words=text.split()
tokens=word_tokenize(text)
tokens=[i.lower() for i in words]
table=str.maketrans("","",string.punctuation)
stripped=[i.translate(table) for i in tokens]
words=[words for words in stripped if words.isalpha()]
stop_words = set(stopwords.words('english'))
words=[w for w in words if not w in stop_words]
print(words)

以下错误消息- “ AttributeError：'Series'对象没有属性'split'。

但是当我使用iloc切片时，效果很好。

text=df["Chat"].iloc[0]
words=text.split()
tokens=word_tokenize(text)
tokens=[i.lower() for i in words]
table=str.maketrans("","",string.punctuation)
stripped=[i.translate(table) for i in tokens]
words=[words for words in stripped if words.isalpha()]
stop_words = set(stopwords.words('english'))
words=[w for w in words if not w in stop_words]
print(words)

它工作得很好，输出就是我想要的，即令牌列表。我想将所有聊天项目转换为令牌。

Answer 1

您的数据框称为df，这是一个数据框对象。

当您df["Chat"]进行操作时，您正在索引熊猫系列对象“聊天”。

然后您将应用python函数.split()，但是pandas系列没有这样的属性，因此您遇到了属性错误。

.split()主要用于我相信的字符串。

做df["Chat"].iloc[0]时，您要获取数据帧，将其索引到熊猫系列Chat中，然后索引至第一个值，然后使用.split()。

选项1：

如果要对熊猫系列中的每个单元格应用函数，可以使用.apply()或我相信的lambda。

这是.apply() https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html

的文档

所以看来您应该可以df["Chat"].apply(split)

选项2：

Pandas还允许您使用.str，然后允许您应用函数或字符串可以具有的其他功能。因此您可以尝试df["Chat"].str.split()

如何标记字符串的“ Python Pandas”“系列”

1 个答案: