Question

我在旧的Yelp竞赛的熊猫数据框中使用了NLTK来pos_tag句子。这将返回元组列表（单词，POS）。我想计算每个实例的词性数量。比方说，我如何创建一个函数来计算每个评论中动词的数量？我知道如何将功能应用于功能 - 没有问题。我无法解决如何计算pd功能内部列表中元组内部事物的问题。

The head is here, as a tsv: https://pastebin.com/FnnBq9rf

Answer 1

例如，对于数据框 df，可以使用此代码将“reviews”列的名词计数保存到新列“noun_count”中。

def NounCount(x):
    nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
    return nounCount

df["noun_count"] = df["reviews"].apply(NounCount)

df.to_csv('./dataset.csv')

Answer 2

有很多方法可以做到这一点，一种非常直接的方法是将元组的列表（或pandas系列）映射到该单词是否为动词的指示符，并计算1＆＃39;你有。

假设您有这样的事情（如果没有，请纠正我，因为您没有提供示例）：

a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])

您可以执行以下操作来映射系列并总结计数：

a.map(lambda x: 1 if x[1]== "verb" else 0).sum()

这将返回2。

我从您分享的链接中抓取了一句话：

text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

Answer 3

谢谢@zhangyulin的帮助。两天后，我学到了一些非常重要的东西（作为新手程序员！）。这是解决方案！

def NounCounter(x):
   nouns = []
   for (word, pos) in x:
        if pos.startswith("NN"):
            nouns.append(word)
    return nouns

df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()

创建一个函数来计算pandas实例中的pos数

3 个答案: