Question

我试图在我收集的一些演讲中计算出口头收缩的次数。一个特别的演讲看起来像这样：

speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."

所以，在这种情况下，我想计算四（4）次收缩。我有一个收缩列表，这里有一些前几个术语：

contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}

我的代码看起来像这样，首先是：

count = 0
for word in speech:
    if word in contractions:
        count = count + 1
print count

然而，由于代码重复了每一个字母，而不是整个单词，所以我没有得到这个。

Answer 1

使用str.split()在空格上拆分字符串：

for word in speech.split():

这将拆分任意空格;这意味着空格，制表符，换行符和一些更奇特的空白字符，以及它们中的任意数量。

您可能需要使用str.lower() 小写您的单词（否则将无法找到Ain't），并删除标点符号：

from string import punctuation

count = 0
for word in speech.lower().split():
    word = word.strip(punctuation)
    if word in contractions:
        count += 1

我在这里使用str.strip() method;它会从单词的开头和结尾删除string.punctuation string中找到的所有内容。

Answer 2

您正在迭代字符串。所以这些项目都是人物。要从字符串中获取单词，您可以使用像str.split()这样的天真方法为您做到这一点（现在您可以迭代一个字符串列表（在str.split（）的参数上拆分的单词，默认值：split在空白上。甚至有re.split()，它更强大。但我不认为你需要用正则表达式分割文本。

你至少要做的是用str.lower()小写你的字符串，或者把所有可能的出现（也用大写字母）放在字典中。我强烈推荐第一种替代方案。后者并非真实可行。删除标点符号也是一项义务。但这还是天真的。如果您需要更复杂的方法，则必须通过单词标记器拆分文本。 NLTK是一个很好的起点，请参阅nltk tokenizer。但我强烈认为这个问题不是你的主要问题，也不会影响你真正解决你的问题。：）

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...

# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re

def abbreviation_counter(input_text, abbreviation_dict):   
    count = 0
    # what you want is a list of words. str.split() does this job for you.
    # " " is default and you can also omit this. But if you really need better
    # methods (see answer text abover), you have to take a word tokenizer tool
    # or have to write your own.
    for word in input_text.split(" "):
        # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
        # using re over `from string import punctuation` is that you have more
        # control in what you want to remove. That means that you can add or
        # remove easily any punctuation mark. It could be very handy. It could be
        # also overpowered. If the latter is the case, just stick to Martijn Pieters
        # solution.
        if re.sub(',|;', '', word).lower() in abbreviation_dict:
            count += 1

    return count

print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)

与Martijn Pieters同时给出答案一点点令人沮丧;）但我希望我仍然为你创造了一些价值观。这就是为什么我编辑了我的问题，为你提供了一些未来工作的暗示。

Answer 3

Python中的for循环迭代迭代中的所有元素。在字符串的情况下，元素是字符。

您需要将字符串拆分为包含单词的字符串列表（或元组）。您可以使用.split(delimiter)。

你的问题非常普遍，所以Python有一个快捷方式：speech.split()分割任意数量的空格/制表符/换行符，所以你只能在列表中找到你的单词。

所以你的代码应该是这样的：

count = 0
for word in speech.split():
    if word in contractions:
        count = count + 1
print(count)

speech.split(" ")也适用，但只能拆分空格而不是制表符或换行符，如果有双倍空格，则会在结果列表中显示空元素。

从列表中计算字符串中元素的出现次数？

3 个答案: