Question

该段旨在包含空格和随机标点符号，我通过执行.replace在for循环中将其删除。然后，我通过.split（）将段落放入列表中，以获得['the'，'title'，'etc']。然后，我使两个函数对单词进行计数以对每个单词进行计数，但是我不想让它对每个单词进行计数，因此我使另一个函数创建了一个唯一列表。但是，我需要创建一个for循环以打印出每个单词以及输出了类似这样的内容

The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.

我也很难理解for循环的本质功能。我读到，我们应该只使用for循环进行计数，而while循环进行任何其他操作，而while循环也可以用于计数。

    paragraph = """  The titled track “Heart Attack” does not interpret the 
    feelings of being in love in a serious way, 
    but with Chuu’s own adorable emoticon like ways. The music video has 
    references to historical and fictional 
    figures such as the artist Rene Magritte!!....  """


for r in ((",", ""), ("!", ""), (".", ""), ("  ", "")):
    paragraph = paragraph.replace(*r)

paragraph_list = paragraph.split()


def count_words(word, word_list):

    word_count = 0
    for i in range(len(word_list)):
        if word_list[i] == word:
            word_count += 1
    return word_count

def unique(word):
    result = []
    for f in word:
        if f not in result:
            result.append(f)
    return result
unique_list = unique(paragraph_list)

Answer 1

最好将re和get使用默认值：

paragraph = """  The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

import re

word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
    word_count[w] = word_count.get(w, 0) + 1
del word_count['']

for k, v in word_count.items():
    print("The word {} appears {} time(s) in the paragraph".format(k, v))

输出：

The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...

如何讨论Chuu’s是可以讨论的，我决定不拆分’，但以后可以根据需要添加。

更新：

以下行使用正则表达式拆分paragraph.lower()。好处是您可以描述多个分隔符

re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

关于此行：

word_count[w] = word_count.get(w, 0) + 1

word_count是字典。使用get的好处是，如果w不在词典中，则可以定义默认值。该行基本上更新了单词w

的计数

Answer 2

当心，示例文本很简单，但标点规则可能很复杂，或者未正确遵守。文本包含2个相邻空格是什么（是的，它不正确但很频繁）？如果作家更习惯法语，并在冒号或分号之前之前后面写空格怎么办？

我认为's构造需要特殊处理。怎么办："""John has a bicycle. Mary says that her one is nicer that John's."""恕我直言，John一词在这里出现两次，而您的算法将看到1 John和1 Johns。

此外，由于Unicode文本现在在WEB页面上很常见，因此您应该准备好查找与空格和标点符号等价的代码：

“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
  U+00A0 NO-BREAK SPACE

此外，根据此older question，删除标点符号的最佳方法是translate。链接的问题使用Python 2语法，但是在Python 3中，您可以执行以下操作：

paragraph = paragraph.strip()                   # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' '))  # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph)  # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()

Answer 3

请尝试以下操作：

paragraph = """  The titled track “Heart Attack” does not interpret the 
feelings of being in love in a serious way, 
but with Chuu’s own adorable emoticon like ways. The music video has 
references to historical and fictional 
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

characterToRemove = (",","!",".","?",'“','”')
for i in paragraph:
    if i in characterToRemove:
         paragraph = paragraph.replace(i,"")

paragraph=paragraph.split()
uniqueWords=set(paragraph)
dictionartWords={}
for i in uniqueWords:
    dictionartWords[i]=0

for i in paragraph:
    if i in dictionartWords.keys():
        dictionartWords[i]+=1

因此，您获得的词典将包含唯一词作为关键字和数字值，以指示段落中每个唯一词的数量：

 print(dictionartWords)

{'The'：2，'like'：1，'serious'：1，'titled'：1，'Rene'：1，'a'：1，'artist'：1，'video'： 1，'c'：7，'with'：1，'track'：1，'to'：1，'fictional'：1，'feelings'：1，'ccc'：1，'but'：1， 'not'：1，'has'：1，'解释'：1，'way'：1，'as'：1，'of'：1，'emoticon'：1，'Heart'：1，'in '：2，'adorable'：1，'love'：1，'references'：1，'being'：1，'Magritte'：1，1，'Chuu's :: 1，'historical'：1，'such'： 1，'和'：1，'做'：1，'音乐'：1，'the'：2，'人物'：1，'攻击'：1，'拥有'：1，'方式'：1}

如何使用python中的for循环从字符串中打印每个唯一单词的频率

3 个答案: