Question

我在遍历熊猫数据框中的行时遇到问题。我需要为每行（包含字符串）确定以下内容：

字符串中每个标点的计数；
大写字母数

为回答第一点，我尝试了以下字符串操作，以查看该方法是否也适用于数据帧：

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

t= "Have a non-programming question?"
t_low = search.lower()   
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(t_low) 
  
m = [w for w in word_tokens if not w in stop_words] 
  
m = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        m.append(w)

然后，在标记化之后对它们进行计数：

import string
from collections import Counter


c = Counter(word_tokens)  

for x in string.punctuation: 
    print(p , c[x])

对于第二点，我对句子应用了以下内容：

 sum(1 for c in t if c.isupper()))

但是，这种情况仅适用于字符串。由于我有一个如下所示的熊猫数据框：

Text



"Have a non-programming question?"
More helpful LINK!
Show SOME CODE... and so on...

我想知道我需要如何应用上述代码才能获得相同的信息。任何帮助都会很棒。谢谢

Answer 1

您可以在DF上使用lambda函数来做到这一点：

import string
def Capitals(strng):
    return sum(1 for c in strng if c.isupper())

def Punctuation(strng):
    return sum([1 for c in strng if c in string.punctuation])

df['Caps'] = df['name'].apply(lambda x:Capitals(x))
df['Punc'] = df['name'].apply(lambda x:Punctuation(x))

Caps是一个带有大写字母数的新列。标点符号是带有标点符号数量的新列。名称是经过测试的字符串。

遍历行以确定特定字数

1 个答案: