使用套装

Question

所以我想要比较两个文件/词典，使用二进制搜索实现（是的，这显然是家庭作业）。

一个文件是

美式英语

Amazon
Americana
Americanization
Civilization

另一个文件是

英英

Amazon
Americana
Americanisation
Civilisation

下面的代码应该非常简单。导入文件，比较它们，返回差异。但是，在靠近底部的地方，它显示entry == found_difference:我觉得好像调试器正好跳过，即使我可以看到内存中的两个变量不同，我只得到最后返回的最后一个元素。我哪里错了？

# File importer
def wordfile_to_list(filename):
    """Converts a list of words to a Python list"""

    wordlist = []

    with open(filename) as f:
        for line in f:
            wordlist.append(line.rstrip("\n"))

    return wordlist

# Binary search algorithm
def binary_search(sorted_list, element):
    """Search for element in list using binary search. Assumes sorted list"""
    matches = []

    index_start = 0
    index_end = len(sorted_list)
    while (index_end - index_start) > 0:
        index_current = (index_end - index_start) // 2 + index_start
        if element == sorted_list[index_current]:
            return True
        elif element < sorted_list[index_current]:
            index_end = index_current
        elif element > sorted_list[index_current]:
            index_start = index_current + 1
        return element


# Check file differences using the binary search algorithm
def wordfile_differences_binarysearch(file_1, file_2):
    """Finds the differences between two plaintext lists,
    using binary search algorithm, and returns them in a new list"""

    wordlist_1 = wordfile_to_list(file_1)
    wordlist_2 = wordfile_to_list(file_2)

    matches = []

    for entry in wordlist_1:
        found_difference = binary_search(sorted_list=wordlist_2, element=entry)
        if entry == found_difference:
            pass
    else:
        matches.append(found_difference)

    return matches


# Check if it works
differences = wordfile_differences_binarysearch(file_1="british-english", file_2="american-english")
print(differences)

Answer 1

您的else声明没有if套件。您的if语句不执行任何操作（当测试为true时使用pass，否则会跳过）。

else循环确实有for套件：

for entry in wordlist_1:
    # ...
else:
    matches.append(found_difference)

for循环也可以有<{1}}套件; ;它在没有else语句的循环完成时执行。因此，当您的break循环完成时，会追加for的当前值;所以无论最后分配给那个名字。

如果found_difference套件是else测试的一部分，请修复缩进：

if

但是，你不应该在那里使用for entry in wordlist_1: found_difference = binary_search(sorted_list=wordlist_2, element=entry) if entry == found_difference: pass else: matches.append(found_difference)语句，只需反转测试：

pass

请注意，变量名称matches = [] for entry in wordlist_1: found_difference = binary_search(sorted_list=wordlist_2, element=entry) if entry != found_difference: matches.append(found_difference)在这里感觉不对;您要附加其他列表中缺少的单词，而不是匹配的单词。也许matches在这里是一个更好的变量名。

请注意，您的missing函数始终返回binary_search()，即您搜索的字词。这总是等于你传入的元素，所以你不能用它来检测一个单词是否有所不同！您需要取消显示最后element行，然后返回return：

False

现在，您可以在def binary_search(sorted_list, element): """Search for element in list using binary search. Assumes sorted list""" matches = [] index_start = 0 index_end = len(sorted_list) while (index_end - index_start) > 0: index_current = (index_end - index_start) // 2 + index_start if element == sorted_list[index_current]: return True elif element < sorted_list[index_current]: index_end = index_current elif element > sorted_list[index_current]: index_start = index_current + 1 return False循环中使用列表推导：

wordfile_differences_binarysearch()

最后但并非最不重要的是，您不必重新发明二元游标轮，只需使用bisect module：

[entry for entry in wordlist_1 if not binary_search(wordlist_2, entry)]

Answer 2

使用套装

二进制搜索用于提高算法的效率，并降低从O(log n)到wordlist1的复杂性。

由于天真的方法是检查wordlist2中O(n**2)中每个单词的每个单词，因此复杂度为O(n * log n)。

使用二分搜索有助于获得O(n)，这已经好多了。

使用sets，您可以获得american = """Amazon Americana Americanization Civilization""" british = """Amazon Americana Americanisation Civilisation""" american = {line.strip() for line in american.split("\n")} british = {line.strip() for line in british.split("\n")}：

print(american - british)
# {'Civilization', 'Americanization'}

你可以得到英国词典中没有的美国词语：

print(british - american)
# {'Civilisation', 'Americanisation'}

你可以得到美国词典中没有的英国单词：

print(american ^ british)
# {'Americanisation', 'Civilisation', 'Americanization', 'Civilization'}

你可以得到最后两组的联合。即恰好在一个词典中出现的单词：

american = """Amazon
Americana
Americanism
Americanization
Civilization"""

british = """Amazon
Americana
Americanisation
Americanism
Civilisation"""

american = [line.strip() for line in american.split("\n")]
british = [line.strip() for line in british.split("\n")]

n1, n2 = len(american), len(british)
i, j = 0, 0

while True:
    try:
        w1 = american[i]
        w2 = british[j]
        if w1 == w2:
            i += 1
            j += 1
        elif w1 < w2:
            print('%s is in american dict only' % w1)
            i += 1
        else:
            print('%s is in british dict only' % w2)
            j += 1
    except IndexError:
        break

for w1 in american[i:]:
    print('%s is in american dict only' % w1)

for w2 in british[j:]:
    print('%s is in british dict only' % w2)

这种方法比任何二进制搜索实现更快，更简洁。但是，如果您真的想要像往常一样使用它，那么@MartijnPieters' answer就不会出错。

使用两个迭代器

由于您知道这两个列表已经排序，您可以简单地在两个排序列表上并行查找并查找任何差异：

Americanisation is in british dict only
Americanization is in american dict only
Civilisation is in british dict only
Civilization is in american dict only

输出：

O(n)

它也是@Test public void getPersonsForApiConsumerTest() throws Exception { mockMvc.perform(get(getUri("/consumers/1/persons"))) .andExpect(status().isOk()) .andExpect(jsonPath("$", hasSize(2))) .andExpect(jsonPath("$[1].name", is("Ligza"))) .andExpect(jsonPath("$[2].name", is("Vekrir"))); } @Test public void getPersonsForApiConsumerMapTest() throws Exception { mockMvc.perform(get(getUri("/consumers/1/persons/map"))) .andExpect(status().isOk()) .andExpect(jsonPath("$[1].name", is("Verkir"))) .andExpect(jsonPath("$[2].name", is("Ligza"))); }。

在我的循环中的某个地方，它不会将结果附加到列表中。为什么？

2 个答案:

使用套装

使用两个迭代器