给出两个包含字符串的列表。
其中一个包含世界各地的组织名称(主要是大学) - 不仅用英文写成,而且总是使用拉丁字母。
另一个列表主要包含完整地址,其中可能出现第一个列表中的字符串(组织)。
一个例子:
addresses = [
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
"Department of Computer Science, Cornell University, Ithaca, NY",
"University of Wisconsin-Madison"
]
organisations = [
"Catholic University of Leuven"
"Fraunhofer IAIS"
"Cornell University of Ithaca"
"Tübingener Max Plank Institut"
]
如您所见,所需的映射将是:
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --
我的想法是使用某种“disctance-算法”来计算字符串的相似性。因为我不能仅仅通过if address in organisation
来查找地址中的组织,因为它可能在不同的地方略有不同。所以我的第一个猜测是使用difflib模块。特别是difflib.get_close_matches()
函数,用于为每个地址选择组织列表中最接近的字符串。但我不太自信,结果将足够准确。虽然我不知道有多高我应该设定接缝的比例作为相似性度量。
在花费太多时间尝试difflib模块之前,我想在这里询问更有经验的人,如果这是正确的方法,或者是否有更合适的工具/方法来解决我的问题。谢谢!
PS:我不需要最佳解决方案。
答案 0 :(得分:2)
使用以下作为字符串距离函数(而不是普通的levenshtein距离):
def strdist(s1, s2):
words1 = set(w for w in s1.split() if len(w) > 3)
words2 = set(w for w in s2.split() if len(w) > 3)
scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
n_shared_words = len([s for s in scores if s <= 3])
return -n_shared_words
然后使用Munkres分配算法shown here,因为组织和地址之间似乎有1:1的映射。
答案 1 :(得分:0)
您可以使用soundex或metaphone将句子翻译成电话列表,然后比较最相似的列表。
的Python实现