我的句子混有中文,韩文和英文单词。我在Python中使用了len()
函数,但它给了我错误的答案。例如,我们有字符串
a = '여보세요,我是Jason. Nice to meet you☺❤'
正确的字号(不包括标点符号)为13,但len(a) = 32
如何正确计算单词数?
非常感谢。
答案 0 :(得分:2)
您可以查看here。我删除了中文标点符号并计算了表情符号的数量。
import re
import emoji
IDEOGRAPHIC_SPACE = 0x3000
def is_asian(char):
"""Is the character Asian?"""
return ord(char) > IDEOGRAPHIC_SPACE
def filter_jchars(c):
"""Filters Asian characters to spaces"""
if is_asian(c):
return ' '
return c
def nonj_len(word):
u"""Returns number of non-Asian words in {word}
– 日本語AアジアンB -> 2
– hello -> 1
@param word: A word, possibly containing Asian characters
"""
# Here are the steps:
# 日spam本eggs
# -> [' ', 's', 'p', 'a', 'm', ' ', 'e', 'g', 'g', 's']
# -> ' spam eggs'
# -> ['spam', 'eggs']
# The length of which is 2!
chars = [filter_jchars(c) for c in word]
return len(''.join(chars).split())
def emoji_count(text):
return len([i for i in a if i in emoji.UNICODE_EMOJI])
def get_wordcount(text):
"""Get the word/character count for text
@param text: The text of the segment
"""
characters = len(text)
chars_no_spaces = sum([not x.isspace() for x in text])
asian_chars = sum([is_asian(x) for x in text])
non_asian_words = nonj_len(text)
emoji_chars = emoji_count(text)
words = non_asian_words + asian_chars + emoji_chars
return dict(characters=characters,
chars_no_spaces=chars_no_spaces,
asian_chars=asian_chars,
non_asian_words=non_asian_words,
emoji_chars = emoji_chars,
words=words)
def dict2obj(dictionary):
"""Transform a dictionary into an object"""
class Obj(object):
def __init__(self, dictionary):
self.__dict__.update(dictionary)
return Obj(dictionary)
def get_wordcount_obj(text):
"""Get the wordcount as an object rather than a dictionary"""
return dict2obj(get_wordcount(text))
if __name__ == '__main__':
a = '여보세요,我是Jason. Nice to meet you☺❤'
a = re.sub(r'[\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*():;《)《》“”()»〔〕-]+', "", a)
b = get_wordcount_obj(a)
print(b.words)
答案 1 :(得分:0)
Python中的len
运算符,当应用于字符串时,会为您提供该字符串中字符的数量,而不是单词数。
如果你想知道字符串中单词的数量,你需要确定一个如何定义单词的机制 - 对于普通英语,例如可以使用空格,你可以使用{{1} }。对于包含unicode字符的混合语言字符串,您需要定义自定义规则,包括分离出每个字符为单词的情况与单词用空格分隔的情况 - 在您的示例中,您需要单独计算英语单词的数量中国人,韩国人和表情符号。