Question

I'm trying to write what I would think of as an extremely simple piece of code in Rascal: Testing if list A contains list B. Starting out with some very basic code to create a list of strings public list[str] makeStringList(int Start, int End) { return [ "some string with number <i>" | i <- [Start..End]]; } public list[str] toTest = makeStringList(0, 200000); My first try was 'inspired' by the sorting example in the tutor: public void findClone(list[str] In, str S1, str S2, str S3, str S4, str S5, str S6) { switch(In) { case [*str head, str i1, str i2, str i3, str i4, str i5, str i6, *str tail]: { if(S1 == i1 && S2 == i2 && S3 == i3 && S4 == i4 && S5 == i5 && S6 == i6) { println("found duplicate\n\t<i1>\n\t<i2>\n\t<i3>\n\t<i4>\n\t<i5>\n\t<i6>"); } fail; } default: return; } } Not very pretty, but I expected it to work. Unfortunately, the code runs for about 30 seconds before crashing with an "out of memory" error. I then tried a better looking alternative: public void findClone2(list[str] In, list[str] whatWeSearchFor) { for ([*str head, *str mid, *str end] := In) if (mid == whatWeSearchFor) println("gotcha"); } with approximately the same result (seems to run a little longer before running out of memory) Finally, I tried a 'good old' C-style approach with a for-loop public void findClone3(list[str] In, list[str] whatWeSearchFor) { cloneLength = size(whatWeSearchFor); inputLength = size(In); if(inputLength < cloneLength) return []; loopLength = inputLength - cloneLength + 1; for(int i <- [0..loopLength]) { isAClone = true; for(int j <- [0..cloneLength]) { if(In[i+j] != whatWeSearchFor[j]) isAClone = false; } if(isAClone) println("Found clone <whatWeSearchFor> on lines <i> through <i+cloneLength-1>"); } } To my surprise, this one works like a charm. No out of memory, and results in seconds. I get that my first two attempts probably create a lot of temporary string objects that all have to be garbage collected, but I can't believe that the only solution that worked really is the best solution. Any pointers would be greatly appreciated. My relevant eclipse.ini settings are -XX:MaxPermSize=512m -Xms512m -Xss64m -Xmx1G

Answer 1

We'll need to look to see why this is happening. Note that, if you want to use pattern matching, this is maybe a better way to write it: public void findClone(list[str] In, str S1, str S2, str S3, str S4, str S5, str S6) { switch(In) { case [*str head, S1, S2, S3, S4, S5, S6, *str tail]: { println("found duplicate\n\t<S1>\n\t<S2>\n\t<S3>\n\t<S4>\n\t<S5>\n\t<S6>"); } default: return; } } If you do this, you are taking advantage of Rascal's matcher to actually find the matching strings directly, versus your first example in which any string would match but then you needed to use a number of separate comparisons to see if the match represented the combination you were looking for. If I run this on 110145 through 110150 it takes a while but works and it doesn't seem to grow beyond the heap space you allocated to it. Also, is there a reason you are using fail? Is this to continue searching?

Answer 2

像Mark Hills这样的算法问题。在Rascal中，一些短代码仍然需要很多嵌套循环，几乎隐式。基本上，您在列表中的模式侧使用的新变量上的每个*拼接运算符都会生成一个级别的循环嵌套，除了最后一个只是列表的其余部分。

在findClone2的代码中，您首先生成所有子列表组合，然后使用if构造过滤它们。这是一个正确的算法，但可能很慢。这是你的代码：

void findClone2(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *str mid, *str end] := In)
    if (mid == whatWeSearchFor)
        println("gotcha");
}

您会看到它如何在In上嵌套循环，因为它在模式中有两个有效的*运算符。因此，代码在O(n^2)中运行，其中n是In的长度。即它具有In列表大小的二次运行时行为。 In是一个很重要的列表，所以这很重要。

在以下新代码中，我们首先使用较少的代码行生成答案时进行过滤：

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *whatWeSearchFor, *str end] := In)
        println("gotcha");
}

第二个*运算符不会生成新循环，因为它不是新鲜的。它只是＆＃34;粘贴＆＃34;给定的列表值进入模式。所以现在实际上只有一个有效*生成一个循环，它是head上的第一个循环。这个使算法循环遍历列表。第二个*测试whatWeSearchFor之后的head元素是否正好在whatWeSearchFor之后（这是*_的大小为线性，然后是最后{{1}只需完成列表即可完成更多内容。

了解克隆有时在哪里也很好：

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*head, *whatWeSearchFor, *_] := In)
        println("gotcha at <size(head)>");
}

Rascal没有一个优化编译器（可能）可能会在内部将您的算法转换为等效的优化算法。因此，作为Rascal程序员，您仍然需要知道循环对算法复杂性的影响，并且知道*是一个非常简短的循环表示法。

Why does this Rascal pattern matching code use so much memory and time?

2 个答案: