Question

我试图从文本文件中提取字段值，格式如下：

{fieldvalue1} {fieldvalue2} {fieldvalue3}

但是，字段值本身可以包含本身用卷曲括号分隔的子字段，例如：

{abc} {xyz} {efg {123} {pqx}}

所以在上面的例子中，所需的输出是：

* fieldvalue1 = abc
* fieldvalue2 = xyz
* fieldvalue3 = efg {123} {pqx}

我尝试了以下过滤器：

sed 's/^{//g;s/}$//g' | awk -F"} {"

然而，这显然无法正确解析上面的 fieldvalue3 。

Answer 1

你可以通过计算字符来强制它：

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2

Answer 2

输入看起来像列表的tcl列表:) Tcl处理得很好。

逐行显示示例读取文件in.txt，并在所需输出中显示字段。

#!/bin/sh
# the next line restarts using expect \
    exec tclsh "$0" "$@"

# open file in.txt
set fd [open in.txt]

# loop till end of file
while {![eof $fd]} {
    # read line
    set line [gets $fd]

    set i 0
    # iterate over all elements
    foreach elm $line {
        incr i
        puts "* fieldvalue$i = $elm"
    }
}
close $fd

或者单行示例处理一行数据。使用了expect，因为它允许在命令行中定义tcl命令

 echo '{abc} {xyz} {efg {123} {pqx}}' | expect -c 'puts [join [lmap _ [gets stdin] {incr i; set _ "* fieldvalue$i = $_"}] \n]'

Answer 3

另一个快速的问题：

#!/usr/bin/awk -f

{
    for(i=1;i<=NF;i++)
    {
        $i = e (e?FS:"") $i

        l = split($i,a,"{")
        r = split($i,a,"}")

        if(l == r)
        {
            print "* fieldvalue" ++c,$i
            e=""
        }
        else
            e = $i

    }
}

如何提取也可以在awk中包含其分隔符的字段

3 个答案: