我正在开发一个个人项目,用于计算文本中提到的名称实例。我知道我可以用collections.Counter()
来做,但我不确定如何有效地解释别名。
例如,假设我想要计算的名字之一是"Tim"
,但我还要计算他喜欢的任何昵称"Timmy"
和"Timster"
。
我有一些字符串说"Oh Tim is going to the party?"
,"Yeah, my boy Timmy, wouldn't miss it, he loves to party!"
,"Whoa, the Timster himself is going? Count me in!"
我希望所有人都像"Tim"
这样的变量。我知道我可以简单地将它们全部计算在一起,然后将计数加在一起。但我觉得有一种更好的方法可以做到。
即。我希望我的代码看起来更像。
names = {
'Tim':{'Tim', 'Timmy', 'Timster'},
... other names here.}
# add any occurrence of Tim names to Tim and other occurrences of other names to their main name.
与
相反total_tim = Counter(tim) + Counter(timmy) + Counter(timster), etc..
表示每个名字。有谁知道我会怎么做呢?
答案 0 :(得分:2)
from collections import Counter
TEXT = '''
Blah Tim blah blah Timmy blah Timster blah Tim
Blah Bill blah blah William blah Billy blah Bill Bill
'''
words = TEXT.split()
# Base names a their aliases.
ALIASES = dict(
Tim = {'Tim', 'Timmy', 'Timster'},
Bill = {'Bill', 'William', 'Billy'},
)
# Given any name, find its base name.
BASE_NAMES = {a : nm for nm, aliases in ALIASES.items() for a in aliases}
# All names.
ALL_NAMES = set(nm for aliases in ALIASES.values() for nm in aliases)
# Count up all names.
detailed_tallies = Counter(w for w in words if w in ALL_NAMES)
# Then build the summary counts from those details.
summary_tallies = Counter()
for nm, n in detailed_tallies.items():
summary_tallies[BASE_NAMES[nm]] += n
print(detailed_tallies)
print(summary_tallies)
# Counter({'Bill': 3, 'Tim': 2, 'Timmy': 1, 'Timster': 1, 'William': 1, 'Billy': 1})
# Counter({'Bill': 5, 'Tim': 4})
答案 1 :(得分:1)
使用正则表达式有助于解决这个问题。
import re
your_dict = {"Tim":["Tim","Timmy","Timster"]}
s = "Oh Tim is going to the party? Yeah, my boy Timmy, wouldn't miss it, he loves to party! Whoa, the Timster himself is going? Count me in!"
for each in your_dict:
print(each,"count = ", len(re.findall("|".join(sorted(your_dict[each],reverse=True)),s)))
如果您想忽略大小写,请在re.IGNORECASE
re.findall
参数
答案 2 :(得分:0)
这是使用正则表达式的一个非常简单的解决方案。
此解决方案的优点是您无需明确命名变体。如果你知道那个人的名字的起始变化,你应该没事。
from collections import Counter
import re
TEXT = '''
Blah Tim blah blah Timmy blah Timster blah Tim
Blah Bill blah blah William blah Billy blah Bill Bill
'''
tim_search = '(Tim([a-z]*)?(?=\ ?))'
bill_search = '((B|W)ill([a-z]*)?(?=\ ?))'
def name_counter(regex_string):
return Counter([i for i, *j in re.findall(regex_string, TEXT)])
name_counter(tim_search)
Counter({'Tim': 2, 'Timmy': 1, 'Timster': 1})
name_counter(bill search)
Counter({'Bill': 3, 'Billy': 1, 'William': 1})