我目前有一些我希望移植到C ++的python代码,因为它目前比我想要的要慢。问题是我在其中使用字典,其中键是由对象和字符串组成的元组(例如(obj,“word”))。 我到底如何在C ++中编写类似内容?也许我的算法很可怕,而且有一些方法可以让它更快,而不需要使用C ++?
为清晰起见,下面的整个算法。字典“post_score”就是问题。
def get_best_match_best(search_text, posts):
"""
Find the best matches between a search query "search_text" and any of the
strings in "posts".
@param search_text: Query to find an appropriate match with in posts.
@type search_text: string
@param posts: List of candidates to match with target text.
@type posts: [cl_post.Post]
@return: Best matches of the candidates found in posts. The posts are ordered
according to their rank. First post in list has best match and so on.
@returntype: [cl_post.Post]
"""
from math import log
search_words = separate_words(search_text)
total_number_of_hits = {}
post_score = {}
post_size = {}
for search_word in search_words:
total_number_of_hits[search_word] = 0.0
for post in posts:
post_score[(post, search_word)] = 0.0
post_words = separate_words(post.text)
post_size[post] = len(post_words)
for post_word in post_words:
possible_match = abs(len(post_word) - len(search_word)) <= 2
if possible_match:
score = calculate_score(search_word, post_word)
post_score[(post, search_word)] += score
if score >= 1.0:
total_number_of_hits[search_word] += 1.0
log_of_number_of_posts = log(len(posts))
matches = []
for post in posts:
rank = 0.0
for search_word in search_words:
rank += post_score[(post, search_word)] * \
(log_of_number_of_posts - log(1.0 + total_number_of_hits[search_word]))
matches.append((rank / post_size[post], post))
matches.sort(reverse=True)
return [post[1] for post in matches]
答案 0 :(得分:3)
map<pair<..., string>, ...>
如果您真的愿意使用C ++。
答案 1 :(得分:2)
一次,你为search_words中的每个search_word调用separate_words(post.text)。您应该为post
中的每个posts
仅调用一次separate_words。
即,而不是:
for search_word in search_words:
for post in posts:
# do heavy work
你应该改为:
for post in posts:
# do the heavy works
for search_word in search_words:
...
如果我怀疑,separate_words会进行大量的字符串操作,请不要忘记字符串操作在python中相对昂贵,因为字符串是不可变的。
您可以做的另一项改进是,您不必将search_words中的每个单词与post_words中的每个单词进行比较。如果将search_words和post_words数组按字长排序,则可以使用滑动窗口技术。基本上,由于search_word只会匹配post_word,如果它们的长度差异小于2,那么你只需要检查两个长度差异的窗口,从而减少要检查的单词数量,例如:
search_words = sorted(search_words, key=len)
g_post_words = collections.defaultdict(list) # this can probably use list of list
for post_word in post_words:
g_post_words[len(post_word)].append(post_word)
for search_word in search_words:
l = len(search_word)
# candidates = itertools.chain.from_iterable(g_post_words.get(m, []) for m in range(l - 2, l + 3))
candidates = itertools.chain(g_post_words.get(l - 2, []),
g_post_words.get(l - 1, []),
g_post_words.get(l , []),
g_post_words.get(l + 1, []),
g_post_words.get(l + 2, [])
)
for post_word in candidates:
score = calculate_score(search_word, post_word)
# ... and the rest ...
(这段代码可能无法正常工作,只是为了说明这个想法)