所以我想要比较两个文件/词典,使用二进制搜索实现(是的,这显然是家庭作业)。
一个文件是
美式英语
Amazon
Americana
Americanization
Civilization
另一个文件是
英英
Amazon
Americana
Americanisation
Civilisation
下面的代码应该非常简单。导入文件,比较它们,返回差异。但是,在靠近底部的地方,它显示entry == found_difference:
我觉得好像调试器正好跳过,即使我可以看到内存中的两个变量不同,我只得到最后返回的最后一个元素。我哪里错了?
# File importer
def wordfile_to_list(filename):
"""Converts a list of words to a Python list"""
wordlist = []
with open(filename) as f:
for line in f:
wordlist.append(line.rstrip("\n"))
return wordlist
# Binary search algorithm
def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return element
# Check file differences using the binary search algorithm
def wordfile_differences_binarysearch(file_1, file_2):
"""Finds the differences between two plaintext lists,
using binary search algorithm, and returns them in a new list"""
wordlist_1 = wordfile_to_list(file_1)
wordlist_2 = wordfile_to_list(file_2)
matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
return matches
# Check if it works
differences = wordfile_differences_binarysearch(file_1="british-english", file_2="american-english")
print(differences)
答案 0 :(得分:4)
您的else
声明没有if
套件。您的if
语句不执行任何操作(当测试为true时使用pass
,否则会跳过)。
else
循环确实有for
套件:
for entry in wordlist_1:
# ...
else:
matches.append(found_difference)
for
循环也可以有<{1}}套件; ;它在没有else
语句的循环完成时执行。因此,当您的break
循环完成时,会追加for
的当前值;所以无论最后分配给那个名字。
如果found_difference
套件是else
测试的一部分,请修复缩进:
if
但是,你不应该在那里使用for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
语句,只需反转测试:
pass
请注意,变量名称matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry != found_difference:
matches.append(found_difference)
在这里感觉不对;您要附加其他列表中缺少的单词,而不是匹配的单词。也许matches
在这里是一个更好的变量名。
请注意,您的missing
函数始终返回binary_search()
,即您搜索的字词。这总是等于你传入的元素,所以你不能用它来检测一个单词是否有所不同!您需要取消显示最后element
行,然后返回return
:
False
现在,您可以在def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return False
循环中使用列表推导:
wordfile_differences_binarysearch()
最后但并非最不重要的是,您不必重新发明二元游标轮,只需使用bisect
module:
[entry for entry in wordlist_1 if not binary_search(wordlist_2, entry)]
答案 1 :(得分:1)
二进制搜索用于提高算法的效率,并降低从O(log n)
到wordlist1
的复杂性。
由于天真的方法是检查wordlist2
中O(n**2)
中每个单词的每个单词,因此复杂度为O(n * log n)
。
使用二分搜索有助于获得O(n)
,这已经好多了。
使用sets,您可以获得american = """Amazon
Americana
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Civilisation"""
american = {line.strip() for line in american.split("\n")}
british = {line.strip() for line in british.split("\n")}
:
print(american - british)
# {'Civilization', 'Americanization'}
你可以得到英国词典中没有的美国词语:
print(british - american)
# {'Civilisation', 'Americanisation'}
你可以得到美国词典中没有的英国单词:
print(american ^ british)
# {'Americanisation', 'Civilisation', 'Americanization', 'Civilization'}
你可以得到最后两组的联合。即恰好在一个词典中出现的单词:
american = """Amazon
Americana
Americanism
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Americanism
Civilisation"""
american = [line.strip() for line in american.split("\n")]
british = [line.strip() for line in british.split("\n")]
n1, n2 = len(american), len(british)
i, j = 0, 0
while True:
try:
w1 = american[i]
w2 = british[j]
if w1 == w2:
i += 1
j += 1
elif w1 < w2:
print('%s is in american dict only' % w1)
i += 1
else:
print('%s is in british dict only' % w2)
j += 1
except IndexError:
break
for w1 in american[i:]:
print('%s is in american dict only' % w1)
for w2 in british[j:]:
print('%s is in british dict only' % w2)
这种方法比任何二进制搜索实现更快,更简洁。但是,如果您真的想要像往常一样使用它,那么@MartijnPieters' answer就不会出错。
由于您知道这两个列表已经排序,您可以简单地在两个排序列表上并行查找并查找任何差异:
Americanisation is in british dict only
Americanization is in american dict only
Civilisation is in british dict only
Civilization is in american dict only
输出:
O(n)
它也是@Test
public void getPersonsForApiConsumerTest() throws Exception {
mockMvc.perform(get(getUri("/consumers/1/persons")))
.andExpect(status().isOk())
.andExpect(jsonPath("$", hasSize(2)))
.andExpect(jsonPath("$[1].name", is("Ligza")))
.andExpect(jsonPath("$[2].name", is("Vekrir")));
}
@Test
public void getPersonsForApiConsumerMapTest() throws Exception {
mockMvc.perform(get(getUri("/consumers/1/persons/map")))
.andExpect(status().isOk())
.andExpect(jsonPath("$[1].name", is("Verkir")))
.andExpect(jsonPath("$[2].name", is("Ligza")));
}
。