线的最后一个字/下一个第一个字的最长链

时间:2016-10-22 00:05:16

标签: python optimization

好的,所以我试图从文本文件中找到最长的链,其中一行的最后一个单词是下一行的第一个单词(适用于诗歌)。我必须使用的Python脚本运行良好,但仍然需要很长时间。我不是编码专家,真的不知道优化。我是否经历了比必要更多的选择? 如何减少运行较长文本所需的时间?

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import sys

# Opening the source text
with open("/text.txt") as g:
    all_lines = g.readlines()

def last_word(particular_line):
    if particular_line != "\n": 
        particular_line = re.sub(ur'^\W*|\W*$', "",particular_line)
        if len(particular_line) > 1:
            return particular_line.rsplit(None, 1)[-1].lower()

def first_word(particular_line):
    if particular_line != "\n": 
        particular_line = re.sub(ur'^\W*|\W*$', "",particular_line) 
        if len(particular_line) > 1:
            return particular_line.split(None, 1)[0].lower()

def chain(start, lines, depth):
    remaining = list(lines) 
    del remaining[remaining.index(start)]
    possibles = [x for x in remaining if (len(x.split()) > 2) and (first_word(x) == last_word(start))]
    maxchain = []
    for c in possibles:
        l = chain(c, remaining, depth)
        sys.stdout.flush()
        sys.stdout.write(str(depth) + " of " + str(len(all_lines)) + "   \r")
        sys.stdout.flush()
        if len(l) > len(maxchain):
            maxchain = l
            depth = str(depth) + "." + str(len(maxchain))
    return [start] + maxchain

#Start
final_output = []

#Finding the longest chain

for i in range (0, len(all_lines)):
    x = chain(all_lines[i], all_lines, i)
    if len(x) > 2:  
        final_output.append(x)
final_output.sort(key = len)

#Output on screen
print "\n\n--------------------------------------------"

if len(final_output) > 1: 
    print final_output[-1]
else: 
    print "Nothing found"

3 个答案:

答案 0 :(得分:1)

import itertools
def matching_lines(line_pair):
    return line_pair[0].split()[-1].lower() == line_pair[1].split()[0].lower()

line_pairs = ((line,next_line) for line,next_line in itertools.izip(all_lines,all_lines[1:]))
grouped_pairs = itertools.groupby(line_pairs,matching_lines)
print max([len(list(y))+1 for x,y in grouped_pairs if x])

虽然我不确定它会更快(但我认为它将是因为它只迭代一次并主要使用内置)

答案 1 :(得分:0)

是的,此代码的复杂性为$ O(n ^ 2)$。这意味着如果你的文件有n行,那么你的代码将执行的迭代量是第一行的1 *(n-1),然后是第二行的1 *(n-2)等,n个这样的元素。对于大n,这相对等于$ n ^ 2 $。实际上,这行代码中存在一个错误

del remaining[remaining.index(start)]

你可能想要运行它:

del remaining[:remaining.index(start)]

(注意方括号中的':')扩展了运行时间(现在你有(n-1)+(n-1)+ .. +(n-1)= n *(n-1),稍大于(n-1)+(n-2)+(n-3)..)。
你可以这样优化代码:从maxchainlen = 0开始,curchainlen = 0.现在,遍历这些行,每次比较当前行的第一个单词和前一行的最后一个单词。如果它们匹配,则将curchainlen增加1.如果它们没有,请检查maxchainlen< curchainlen,如果是这样,将maxchainlen = curchainlen和init curchainlen指定为0.完成迭代后,再次对maxchainlen进行检查。例如:

lw = last_word(lines[0])
curchainlen = 0
maxchainlen = 0
for l in lines[2:]:
    if lw = first_word(l):
        curchainlen = curchainlen + 1
    else:
        maxchainlen = max(maxchainlen, curchainlen)
        curchainlen = 0
maxchainlen = max(maxchainlen, curchainlen)
print(maxchainlen)

答案 2 :(得分:0)

我尝试将这项工作分为两个阶段:首先找到链,然后比较它们。这将简化代码。由于链将是文件中所有行的一小部分,因此首先找到它们然后对它们进行排序将比在一个大的过程中处理整个事物更快。

如果使用python yield关键字,问题的第一部分要容易得多,这与return类似,但不会结束函数。这使您可以一次一行地循环您的内容并以小的方式处理它,而无需始终将整个内容保存在内存中。

这是一次一行获取文件的基本方法。它使用yield在找到链时拉出链

def get_chains(*lines):
    # these hold the last token and the
    # members of this chain
    previous = None
    accum = []

    # walk through the lines,
    # seeing if they can be added to the existing chain in `accum`
    for each_line in lines:
        # split the line into words, ignoring case & whitespace at the ends
        pieces = each_line.lower().strip().split(" ")
        if pieces[0] == previous:
            # match? add to accum
            accum.append(each_line)
        else:
            # no match? yield our chain
            # if it is not empty
            if accum:
                yield accum
                accum = []
        # update our idea of the last, and try the next line
        previous = pieces[-1]

    # at the end of the file we need to kick out anything
    # still in the accumulator
    if accum:
        yield accum

当您为此函数提供一串行时,如果找到它们,它将yield输出链,然后继续。无论谁调用该函数都可以捕获所产生的链并用它们做事。

一旦你有了链条,很容易按长度排序并选择最长的链条。由于Python具有内置列表排序,只需收集行长列表 - >线对并对其进行排序。最长的一行将是最后一项:

 def longest_chain(filename):

    with open (filename, 'rt') as file_handle:
         # if you loop over an open file, you'll get
         # back the lines in the file one at a time
        incoming_chains = get_chains(*file_handle)

        # collect the results into a list, keyed by lengths
        all_chains = [(len(chain), chain ) for chain in incoming_chains]
        if all_chains:
            all_chains.sort()
            length, lines = all_chains[-1]
            # found the longest chain
            return "\n".join(lines)
        else:
            # for some reason there are no chains of connected lines
            return  []