该段旨在包含空格和随机标点符号,我通过执行.replace在for循环中将其删除。然后,我通过.split()将段落放入列表中,以获得['the','title','etc']。然后,我使两个函数对单词进行计数以对每个单词进行计数,但是我不想让它对每个单词进行计数,因此我使另一个函数创建了一个唯一列表。但是,我需要创建一个for循环以打印出每个单词以及输出了类似这样的内容
The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.
我也很难理解for循环的本质功能。我读到,我们应该只使用for循环进行计数,而while循环进行任何其他操作,而while循环也可以用于计数。
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... """
for r in ((",", ""), ("!", ""), (".", ""), (" ", "")):
paragraph = paragraph.replace(*r)
paragraph_list = paragraph.split()
def count_words(word, word_list):
word_count = 0
for i in range(len(word_list)):
if word_list[i] == word:
word_count += 1
return word_count
def unique(word):
result = []
for f in word:
if f not in result:
result.append(f)
return result
unique_list = unique(paragraph_list)
答案 0 :(得分:3)
最好将re
和get
使用默认值:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
import re
word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
word_count[w] = word_count.get(w, 0) + 1
del word_count['']
for k, v in word_count.items():
print("The word {} appears {} time(s) in the paragraph".format(k, v))
输出:
The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...
如何讨论Chuu’s
是可以讨论的,我决定不拆分’
,但以后可以根据需要添加。
更新:
以下行使用正则表达式拆分paragraph.lower()
。好处是您可以描述多个分隔符
re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()
关于此行:
word_count[w] = word_count.get(w, 0) + 1
word_count
是字典。使用get
的好处是,如果w
不在词典中,则可以定义默认值。该行基本上更新了单词w
答案 1 :(得分:0)
当心,示例文本很简单,但标点规则可能很复杂,或者未正确遵守。文本包含2个相邻空格是什么(是的,它不正确但很频繁)?如果作家更习惯法语,并在冒号或分号之前之前后面写空格怎么办?
我认为's
构造需要特殊处理。怎么办:"""John has a bicycle. Mary says that her one is nicer that John's."""
恕我直言,John
一词在这里出现两次,而您的算法将看到1 John
和1 Johns
。
此外,由于Unicode文本现在在WEB页面上很常见,因此您应该准备好查找与空格和标点符号等价的代码:
“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
U+00A0 NO-BREAK SPACE
此外,根据此older question,删除标点符号的最佳方法是translate
。链接的问题使用Python 2语法,但是在Python 3中,您可以执行以下操作:
paragraph = paragraph.strip() # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' ')) # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph) # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()
答案 2 :(得分:-1)
请尝试以下操作:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
characterToRemove = (",","!",".","?",'“','”')
for i in paragraph:
if i in characterToRemove:
paragraph = paragraph.replace(i,"")
paragraph=paragraph.split()
uniqueWords=set(paragraph)
dictionartWords={}
for i in uniqueWords:
dictionartWords[i]=0
for i in paragraph:
if i in dictionartWords.keys():
dictionartWords[i]+=1
因此,您获得的词典将包含唯一词作为关键字和数字值,以指示段落中每个唯一词的数量:
print(dictionartWords)
{'The':2,'like':1,'serious':1,'titled':1,'Rene':1,'a':1,'artist':1,'video': 1,'c':7,'with':1,'track':1,'to':1,'fictional':1,'feelings':1,'ccc':1,'but':1, 'not':1,'has':1,'解释':1,'way':1,'as':1,'of':1,'emoticon':1,'Heart':1,'in ':2,'adorable':1,'love':1,'references':1,'being':1,'Magritte':1,1,'Chuu's :: 1,'historical':1,'such': 1,'和':1,'做':1,'音乐':1,'the':2,'人物':1,'攻击':1,'拥有':1,'方式':1}