Question

我有一个10000行perl变量的文件。此变量定义应用程序及其给定的依赖项。这是该文件的样子：

'im-an-app' =>
{    
    do-this =>
    {    
        needs => [ 'ruby', 'jboss', 'jquery' ],
        process =>
        [    
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'qucikly' },
        ],
    },
},

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

'im-an-app3' =>
{
    needs =>
    {
        requires => [ 'ruby', 'jboss', 'jquery', 'sass' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'quickly' },
        ],
    },
},

我有一个我要从文件中删除的应用程序列表，其中包含一个单独的list.txt文件，如下所示：

im-an-app1
im-an-app3
im-an-app16
im-an-app17
im-an-app29

这些应用程序都是不同的名称，我使用占位符，我有大约500个我需要迭代，匹配和从我的应用程序文件中删除。

我已加载文件IRB，当我读取文件时，我得到的格式如下：

实例\ n \ t ## \ n \ n \ t'im-an-app'=＆gt; \ n \ t {\ n \ t \ tdo-this =＆gt; \ n \ t \ t \ t \ t n \ t \ t \ tneeds =＆gt; ['ham-and-cheese-sandwich']，\ n \ t \ t}，\ n \ t}，\ n \ n \ t'im-the-next-app'=＆gt; \ n \ t {\ n \ t \ tneeds =＆gt; \ n \ t \ t \ t \ t \ {\ n \ t \ t \ t \ t \ t \ t我对这个应用程序的评论\ n \ t \ t \ t＃im另一条评论\ n \ t \ t \ t \ tneeds =＆GT; ['backlava'，'cand-corns'，'popscicles'，'yum-yum-bars'，'炸弹酱'，'corndogs']，\ n \ t \ t \ ttt-this =＆gt; \ n \ t \ t \ t [\ n \ t \ t \ t \ t {say =＆gt; 'hi'}，\ n \ t \ t \ t \ t {say =＆gt; 'bye'}，\ n \ t \ t \ t \ t {yell =＆gt; 'i-love-gold'}，\ n \ t \ t \ t \ t，\ n \ t \ t}，\ n \ t}，\ n \ n \ t'im-the-third-app'=＆gt; ; \ n \ t {\ n \ t \ tdothis =＆gt; \ n \ t \ t {\ n \ t \ t \ tneeds =＆gt; ['junk'，'jazz'，'json'，'jiffylube'，\ n \ t \ t \ tprocess =＆gt; \ n \ t \ t \ t [\ n \ t \ t \ t \ t \ tt = = ＆GT; 'hi'}，\ n \ t \ t \ t \ t {say =＆gt; 'bye'}，\ n \ t \ t \ t \ t {say =＆gt; 'goonies'}，\ n \ t \ t \ t]，\ n \ t \ t}，\ n \ t}，\ n \ n \ t'im-yet-anotherapp'

我注意到唯一的常量分隔符是仅在新应用程序定义之前存在的\ n \ n \ t。我想搜索读取文件，删除列表中每个应用程序的引用及其所有后续信息，包括\ n \ n \ t。

我正在使用Ruby和IRB来做这件事，但我愿意使用其他形式的操作。

谢谢你们！

Answer 1

如果你想要python，这可能是一个开始（未经测试，因此可能有bug）：

import re
with open( 'yourfilename', 'r' ) as f:
    data = f.read().split('\n\n\t')
    # then you can use some regex to find what you want.
    for entry in data:
        reres = re.search( 'yourpattern', entry )
        if reres:
            del entry
    # Save the results to another file?
    with open( 'outputfile', 'wt' ) as fout:
        fout.write( "\n\n\t".join( data ) )

Answer 2

（编辑：根据新的样本数据更新）

这个awk，如下所示，将应用程序列表加载到一个数组中（添加周围的单引号以匹配应用程序文件）。然后，对于app文件，它将记录分隔符更改为一个或多个空行（RS=""）。对于每条记录，它只打印列表中没有的记录：

$ awk -v ORS="\n\n" -v q="'" 'NR==FNR{a[q $1 q];next} !($1 in a)' app-list.txt RS="" apps.txt

<强>解释

-v ORS="\n\n"

设置输出记录分隔符以在app之间保留额外的换行符写出来的记录。

-v q="'"

这是能够在文本中使用文字单引号的便捷方式单行，由于它本身被单引号包围，否则可能是一种痛苦。

NR==FNR{a[q $1 q];next}

当NR==FNR我们正在阅读第一个文件时，应用列表（转到http://backreference.org/2010/02/10/idiomatic-awk/并搜索＆＃34;双文件处理＆＃34;）。对于列表中的每个应用，请用单引号将其括起来，然后将其输入数组a。

!($1 in a)

一旦我们到达此处，我们就知道我们正在阅读应用文件（再次参见上文链接）。在此文件中，每个应用程序块都被视为单个记录（请参阅 RS=""，见下文）。 $1是引号中应用的名称。我们检查一下是否名称在数组a中，如果不是，我们执行默认操作，即就是简单地打印记录。

app-list.txt RS="" apps.txt

这些是要处理的文件。 Awk允许您更改RS 记录文件之间的分隔符对于应用程序列表，默认值很好，但对于应用程序本身，我们将记录分隔符设置为空串。正如文档所说，＆＃34; 通过特殊的分配，空字符串为 RS的值表示记录由一个或多个空格分隔行＆＃34;，这对此应用程序非常方便。

<强>示范：

$ cat app-list.txt
im-an-app1
im-an-app3
im-an-app16
im-an-app17
im-an-app29


$ cat apps.txt
'im-an-app1' =>
{
    do-this =>
    {
        needs => [ 'ruby', 'jboss', 'jquery' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'qucikly' },
        ],
    },
},

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

'im-an-app3' =>
{
    needs =>
    {
        requires => [ 'ruby', 'jboss', 'jquery', 'sass' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'quickly' },
        ],
    },
},


$ awk -v ORS="\n\n" -v q="'" 'NR==FNR{a[q $1 q];next} !($1 in a)' app-list.txt RS="" apps.txt

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

Answer 3

这可以在Python中完成，如下所示：

import re

remove = set(['im-an-ap', 'im-an-ap-5', 'im-an-ap-10'])

def replace(re_app):
    if re_app.group(2) in remove:
        return ""
    else:
        return re_app.group(1)

with open('input.txt') as f_input, open('output.txt', 'w') as f_output:
    f_output.write(re.sub(r"(^'(.*?)' =.*?(?=\n\n\t|\Z))", replace, f_input.read(), flags=re.S+re.M))

这将加载文件input.txt，删除所有不需要的条目并创建一个名为output.txt的新文件。

解析和删除匹配的字符串，包括3个转义字符

3 个答案: