Question

我正在运行以下代码，并且我收到了一个“被杀死的”＃39;来自python的消息：

import random,string

def rotations(t):
        ''' Return list of rotations of input string t '''
        tt = t * 2
        return [ tt[i:i+len(t)] for i in xrange(0, len(t)) ]
def bwtViaBwm(t):
        return ''.join(map(lambda x: x[-1], bwm(t)))
def bwm(t):
        return sorted(rotations(t))

def build_FM(fname):
        stream=readfile(fname)
        fc=[x[0] for x in bwtViaBwm(stream)]


def readfile(sd):
    s=""
    with open(sd,'r') as myfile:
        s =myfile.read()
    return s.rstrip('\n')

def writefile(sd,N):
        with open(sd, "wb") as sink:
            sink.write(''.join(random.choice(string.ascii_uppercase + string.digits) for _ in xrange(int(N))))
            sink.write('$')
        return
def main():
    fname= sys.argv[1]
    N =sys.argv[2]
    writefile(fname,N)
    build_FM(fname)
    return

if __name__=='__main__':
        main()

输入文件名和数字。代码创建大小为N的随机流，然后在该流上运行BWT转换。当我作为输入N=500000时，我得到了一个被杀的＆＃39;消息，这似乎是内存错误的一个小数字。我的系统运行Ubuntu 14.04,8GB RAM和python 2.7。

这就是我运行脚本的方式：

python  fm.py new_file.csv 500000

我会在几秒钟之后得到这个：

killed

Answer 1

问题在于您的rotations功能：

def rotations(t):
    ''' Return list of rotations of input string t '''
    tt = t * 2
    return [ tt[i:i+len(t)] for i in xrange(0, len(t)) ]

看看它的作用：

>>> rotations('x')
['x']
>>> rotations('xx')
['xx', 'xx']
>>> rotations('xxxxx')
['xxxxx', 'xxxxx', 'xxxxx', 'xxxxx', 'xxxxx']

此结果将呈指数级增长。因此，500000个字符的文件将生成长度为500000^2的结果。

在计算上，不太可能有办法做你正在尝试输入大的输入：这是为了让字符串的每次旋转长达500k个字符。我们知道输入中的每个元素都有一个输出，每个输出都有原始输入的长度。因此，最小尺寸为n*n或n^2。除非你知道你只需要有限数量的这些（并且可以提前剔除它们），否则你总会遇到这个问题。

如何解决问题

首先我们需要确定问题所在。让我们看看代码在做什么。假设一个简单的起始集：

BACB

rotation()提供该组的所有可能轮换：

>>> rotations('bacb')
['bacb', 'acbb', 'cbba', 'bbac']

然后您对此列表进行排序。

>>> sorted(rotations('bacb'))
['acbb', 'bacb', 'bbac', 'cbba']

然后你拿出每个元素的最后一个元素，产生bdac。这转化为对于输入中的每个元素n，您正在分配一个排序顺序，例如n+1 ... n（环绕）将按字母顺序排序。

要解决这个问题，那么算法将是：

创建一个空列表'final_order'，它将是输入列表的'已排序'索引列表。
对于每个元素
- 从该元素加一个
- 以有条不紊的方式放置旋转到'final_order'列表：
- 获取'final_order'列表的第一个元素的'轮换'。
- 比较两个旋转。
- 如果新旋转小于旧旋转，则在该点插入列表。否则转到下一轮。
- 如果没有额外的旋转，请将新旋转放在那里。

（可能有一种更快的排序方式，但为了便于解释，我将使用它。）

我们需要的第一件事是get_rotation(input, idx)：

def get_rotation(input, idx):
    return input[idx + 1:] + input[:idx + 1]

现在很难（见评论）：

def strange_sort(input):
    sorted_indices = list()  # Initialize the list

    for idx in range(len(input)):  # For each element in the list
        new_rotation = get_rotation(input, idx)  # Get the rotation starting at that index
        found_location = False  # Need this to handle the sorting
        for sorted_idx in range(len(sorted_indices)):  # Iterate through all 'found' indices
            old_rotation = get_rotation(input, sorted_indices[sorted_idx])  # Get the rotation starting at the found/old rotation
            if new_rotation < old_rotation:  # Which comes first?
                # If this one, insert the new rotation's starting index before the index of the already sorted rotation
                sorted_indices.insert(sorted_idx, idx)
                found_location = True
                break
        if not found_location:  # If greater than everything, insert at end
            sorted_indices.insert(len(sorted_indices), idx)
    return "".join(map(lambda x: input[x], sorted_indices))  # Join and return result

运行此项我们会在短输入中得到预期结果：

>>> print("Final result={}".format(strange_sort('bacb')))
Final result=bbca

这是带有测试/计时器的完整程序：

import random, string, datetime

def get_rotation(input, idx):
    return input[idx + 1:] + input[:idx + 1]

def strange_sort(input):
    sorted_indices = list()

    for idx in range(len(input)):
        new_rotation = get_rotation(input, idx)
        found_location = False
        for sorted_idx in range(len(sorted_indices)):
            old_rotation = get_rotation(input, sorted_indices[sorted_idx])
            if new_rotation < old_rotation:
                sorted_indices.insert(sorted_idx, idx)
                found_location = True
                break
        if not found_location:
            sorted_indices.insert(len(sorted_indices), idx)
    return "".join(map(lambda x: input[x], sorted_indices))

n1 = 5
n2 = 50
n3 = 500
n4 = 5000
n5 = 50000
n6 = 500000

n = [n1, n2, n3, n4, n5, n6]

def test(lst):
    for l in range(len(lst)):
        input = ''.join(random.choice(string.ascii_uppercase+string.digits) for x in range(lst[l]))
        start = datetime.datetime.now()
        result = strange_sort(input)
        end = datetime.datetime.now()
        runtime = end - start
        print("n{} runtime={} head={} tail={}".format(l, runtime.seconds, result[:5], result[-5:]))

test(n)

尝试利用不需要存储所有内容，只存储初始排序的每个索引的最终排序索引。可悲的是，上面的实现显然太慢了，正如我们从运行它看到的那样：

$ python2 strange_sort.py
n0 runtime=0 head=SJP29 tail=SJP29
n1 runtime=0 head=5KXB4 tail=59WAK
n2 runtime=0 head=JWO54 tail=7PH60
n3 runtime=4 head=Y2X2O tail=MFUGK
(Still running)

好的，所以我们知道那种可怕。我们可以加快速度吗？我们从Python Wiki Entry on Big-O看到，O(M)需要一个字符串切片。对我们来说，这意味着O(N)，因为我们正在采用两个增加全长的切片。这在计算上是一场灾难，因为我们每次都在这样做。

不是每次都获得完整的旋转，而是进行迭代和比较。单个旋转的一个索引与另一个旋转的一个索引的单个比较应为O(2)。在最糟糕的情况下，我们必须O(N)次这样做，但每次都不太可能出现这种情况。

我们添加了一个额外的for循环并将其重新编写为仅查看下一个索引：

for offset in range(len(input)):
    if new_rotation[offset] < input[(sorted_indices[sorted_idx] + offset) % len(input)]:
        sorted_indices.insert(sorted_idx, idx)
        found_location = True
        break
if found_location:
    break

我们现在用我们的计时器执行它：

$ python2 strange_sort.py
n0 runtime=0 head=VA6KY tail=VA6KY
n1 runtime=0 head=YZ39U tail=63V0O
n2 runtime=0 head=JFYKP tail=8EB2S
n3 runtime=0 head=IR4J9 tail=VLR4Z
n4 runtime=28 head=EYKVG tail=7Q3NM
n5 runtime=4372 head=JX4KS tail=6GZ6K

正如我们所看到的，我们这次只用了28秒就到了n4。不过，这对n6来说并不是好兆头。唉，看起来这样的计算复杂性表明我们需要一种比Insertion Sort更好的排序方法，它最差（甚至是平均值）是O(n^2)。输入500K时，至少需要250B（十亿）次计算。（时间n，计算机每次计算所执行的实际指令数。）

我们已经学到的东西是你实际上并不需要把轮换放在一边。要解决这个问题，你必须编写一个快速排序算法，它既不是实际值也不是实际值，而是一个能够以给定的精度计算值的函数。

将整个事情放在头上，我想到了尝试创建一个可以搜索足够远的对象，以了解它如何与另一个对象进行排序，并使用内置的Python排序。

import random, string, datetime
from functools import total_ordering


@total_ordering
class Rotation(object):
    """Describes a rotation of an input based on getting the original and then offsetting it."""

    def __init__(self, original, idx):
        self.original = original
        self.idx = idx

    def getOffset(self, offset):
        return self.original[(self.idx + offset) % len(self.original)]

    def __eq__(self, other):
        print("checking equality")
        if self.idx == other.idx:
            return True
        for offset in range(len(self.original)):
            if self.getOffset(offset) is not other.getOffset(offset):
                print("this={} is not that={}".format(self.getOffset(offset), other.getOffset(
                        offset)))
                return False
        return True

    def __lt__(self, other):
        for offset in range(len(self.original)):
            if self.getOffset(offset) < other.getOffset(offset):
                return True
            elif self.getOffset(offset) > other.getOffset(offset):
                return False
        return False

    def __str__(self):
        return self.getOffset(-1)

    def __repr__(self):
        return "".join(map(lambda x: str(x), [self.getOffset(idx) for idx in range(len(
                self.original))]))


def improved_strange_sort(input):
    original = list(input)
    rotations = [Rotation(original, idx) for idx in range(len(original))]
    result = sorted(rotations)
    # print("original={} rotations={} result={}".format(original, rotations, result))
    return "".join(map(lambda x: str(x), result))


def test(input):
    start = datetime.datetime.now()
    result = improved_strange_sort(input)
    end = datetime.datetime.now()
    runtime = end - start
    print("input={} runtime={} head={} tail={}".format(input[:5], runtime.seconds, result[:5],
                                                       result[-5:]))


def timed_test(lst):
    for l in range(len(lst)):
        print("Test {} with length={}".format(l, lst[l]))
        test(''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(lst[l])))


n1 = 5
n2 = 50
n3 = 500
n4 = 5000
n5 = 50000
n6 = 500000

n = [n1, n2, n3, n4, n5, n6]

test('bacb')
timed_test(n)

这似乎产生了正确的结果：

$ python2 strange_sort.py 
input=bacb runtime=0 head=bbca tail=bbca
Test 0 with length=5
input=FB2EH runtime=0 head=BF2HE tail=BF2HE
Test 1 with length=50
input=JT3ZP runtime=0 head=W8XQE tail=QRUC3
Test 2 with length=500
input=TL8L7 runtime=0 head=R4ZUG tail=M268H
Test 3 with length=5000
input=PYFED runtime=1 head=L5J0T tail=HBSMV
Test 4 with length=50000
input=C6TR8 runtime=254 head=74IIZ tail=U69JG
Test 5 with length=500000
(still running)

Answer 2

我做了一些实验，问题出在rotations(t)。

第一个问题是您将输入字符串的大小加倍，最初为500.000个字符并变为1.000.000。但这仍然是可以承受的，我们仍然在讨论1.5兆字节左右的内存。

但之后你创建了一个500.000个字符串的列表，每个字符串长500.000个字符，这大约等于内存的232 GB，只需要浮动以便下一个计算步骤发生。

这显然是不可能的，因为我们都没有这么多内存，所以你的程序会被杀死。

您询问是否可以optimize此代码...... ..我将其用于is it possible to employ less memory?

让我们说你愿意交换计算时间以换取更少的内存消耗，然后你可以编写一个不需要的算法版本这么多记忆。例如：

def bwtManual(t):
    tt = 2 * t
    res_str = ''
    old_min = None
    for j in xrange(0, len(t)):
        cur_min = None
        print("Round: " + str(j))
        for i in xrange(0, len(t)):
            # generate 1 string at a time
            tmp_str = tt[i:i+len(t)]
            # select an initial minimum string
            # > must not be smaller than previous minimum
            if cur_min is None:
                if old_min is not None:
                    if tmp_str > old_min:
                        cur_min = tmp_str
                    else:
                        continue
                else:
                    cur_min = tmp_str
                continue
            # skip strings that have been already selected
            if old_min is not None and tmp_str <= old_min:
                continue
            # select new minimum among remaining strings
            if (tmp_str < cur_min):
                cur_min = tmp_str
        # store character
        res_str += cur_min[-1]
        old_min = cur_min
    return res_str

小尺寸，没问题，只是有点慢。

在500.000个字符上？在我的机器上需要 115天，这具有平均计算能力。

结束：

从rotations(t)生成的字符串实际上没有任何理由让它们自己存在..这些字符串仅用于允许我们执行sort()然后推断最后一个字符每个字符串。

有可能比这更好吗？我想是的。

我们的想法是设计自己的排序函数，该函数使用对tt的子字符串的引用而不是它的副本。因此，每个旋转只需要几个指针，而不是原始字符串的完整副本。

我试图在python中寻找提示，我发现对象memoryview和buffer看起来很有希望。但是，显然这些包装器似乎不是本地实现比较运算符，而是要求您推断它们指向的字符串（的副本）。这将破坏在您的上下文中使用这些包装器的整个目的，因此它们可能不会有多大用处。你可以查找它们并自己决定。

我认为设计一个C ++模块会更容易，该模块对引用原始字符串的子字符串的抽象节点进行排序，然后返回使用map()代码构建的最终字符串。然后你可以将这个模块连接到你的python代码，或者只是用C ++编写其余的代码。

Python进程被杀死了

2 个答案: