我试图在我收集的一些演讲中计算出口头收缩的次数。一个特别的演讲看起来像这样:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
所以,在这种情况下,我想计算四(4)次收缩。我有一个收缩列表,这里有一些前几个术语:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
我的代码看起来像这样,首先是:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
然而,由于代码重复了每一个字母,而不是整个单词,所以我没有得到这个。
答案 0 :(得分:5)
使用str.split()
在空格上拆分字符串:
for word in speech.split():
这将拆分任意空格;这意味着空格,制表符,换行符和一些更奇特的空白字符,以及它们中的任意数量。
您可能需要使用str.lower()
小写您的单词(否则将无法找到Ain't
),并删除标点符号:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
我在这里使用str.strip()
method;它会从单词的开头和结尾删除string.punctuation
string中找到的所有内容。
答案 1 :(得分:1)
您正在迭代字符串。所以这些项目都是人物。要从字符串中获取单词,您可以使用像str.split()
这样的天真方法为您做到这一点(现在您可以迭代一个字符串列表(在str.split()的参数上拆分的单词,默认值:split在空白上。甚至有re.split()
,它更强大。但我不认为你需要用正则表达式分割文本。
你至少要做的是用str.lower()
小写你的字符串,或者把所有可能的出现(也用大写字母)放在字典中。我强烈推荐第一种替代方案。后者并非真实可行。删除标点符号也是一项义务。但这还是天真的。如果您需要更复杂的方法,则必须通过单词标记器拆分文本。 NLTK是一个很好的起点,请参阅nltk tokenizer。但我强烈认为这个问题不是你的主要问题,也不会影响你真正解决你的问题。 :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
与Martijn Pieters同时给出答案一点点令人沮丧;)但我希望我仍然为你创造了一些价值观。这就是为什么我编辑了我的问题,为你提供了一些未来工作的暗示。
答案 2 :(得分:0)
Python中的for
循环迭代迭代中的所有元素。在字符串的情况下,元素是字符。
您需要将字符串拆分为包含单词的字符串列表(或元组)。您可以使用.split(delimiter)
。
你的问题非常普遍,所以Python有一个快捷方式:speech.split()
分割任意数量的空格/制表符/换行符,所以你只能在列表中找到你的单词。
所以你的代码应该是这样的:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ")
也适用,但只能拆分空格而不是制表符或换行符,如果有双倍空格,则会在结果列表中显示空元素。