Python:我想检查字符串中单词的计数

时间:2016-09-22 13:24:32

标签: python count

我设法做到了,但我正在努力的情况是我必须考虑所有这些单词的'color'等于'color'并相应地返回计数。为了做到这一点,我写了一个常用单词字典,其中包括美国和英国英语的拼写更改,但很可靠,这不是正确的方法。

 ukus=dict()      ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE',
'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}

这是我为检查值而编写的字典。正如您所看到的,这会降低性能。 Pyenchant不适用于64位python。有人请帮帮我。提前谢谢。

2 个答案:

答案 0 :(得分:0)

第1步: 创建一个临时字符串,然后用你的dict的values替换所有单词,并使用相应的键作为:

>>> temp_string = str(my_string)
>>> for k, v in ukus.items():
...     temp_string = temp_string.replace(" {} ".format(v), " {} ".format(k))  # <--surround by space " " to replace only words

第2步: 现在,为了在字符串中查找单词,首先将其拆分为list个单词,然后使用itertools.Counter()计算list中每个元素的计数。以下是示例代码:

>>> from collections import Counter
>>> my_string = 'Hello World! Hello again. I am saying Hello one more time'
>>> count_dict = Counter(my_string.split())
# Value of count_dict:
# Counter({'Hello': 3, 'saying': 1, 'again.': 1, 'I': 1, 'am': 1, 'one': 1, 'World!': 1, 'time': 1, 'more': 1})
>>> count_dict['Hello']
3

第3步: 现在,由于您希望在dict中计算“颜色”和“颜色”,重新迭代dict以添加这些值,并将缺失值添加为“0”

for k, v in ukus.items():
    if k in count_dict:
        count_dict[v] = count_dict[k]
    else:
        count_dict[v] = count_dict[k] = 0   

答案 1 :(得分:0)

好的,我想我已经从你的评论中得到足够的信息,可以提供这个解决方案。下面的功能允许您选择英国或美国替代品(它使用美国默认值,但您当然可以翻转它)并允许您对字符串执行轻微的卫生。

import re

ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE'}
usuk={'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}

def str_wd_count(my_string, uk=False, hygiene=True):
    us = not(uk)
    # if the UK flag is TRUE, default to UK version, else default to US version
    print "Using the "+uk*"UK"+us*"US"+" dictionary for default words"

    # optional hygiene of non-alphanumeric characters for pure word counting
    if hygiene:
        my_string = re.sub('[^ \d\w]',' ',my_string)
        my_string = re.sub(' {1,}',' ',my_string)

    # create a list of the unqique words in the text
    ttl_wds = [ukus.get(w,w) if us else usuk.get(w,w) for w in my_string.upper().split(' ')]
    wd_counts = {}
    for wd in ttl_wds:
        wd_counts[wd] = wd_counts.get(wd,0)+1

    return wd_counts

作为使用示例,请考虑字符串

str1 = 'The colour of the dog is not the same as the color of the tire, or is it tyre, I can never tell which one will fulfill'

# Resulting sorted dict.items() With Default Settings
'[(THE,5),(TIRE,2),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(IT,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1)]'

# Resulting sorted dict.items() With hygiene=False
'[(THE,5),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1),(IT,1),(TYRE,,1)]'

# Resulting sorted dict.items() With UK Swap, hygiene=True
'[(THE,5),(OF,2),(IS,2),(TYRE,2),(COLOUR,2),(WHICH,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(OR,1),(WILL,1),(AS,1),(CAN,1),(TELL,1),(NOT,1),(FULFILL,1),(ONE,1),(IT,1)]'

# Resulting sorted dict.items() With UK Swap, hygiene=False
'[(THE,5),(OF,2),(IS,2),(COLOUR,2),(ONE,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(FULFILL,1),(TYRE,,1),(IT,1),(OR,1)]'

您可以以任何您喜欢的方式使用生成的字数字典,如果您需要添加修改的原始字符串,则很容易修改该函数以返回该字符。