Question

我正在处理相当大的推文集合，我想为每条推文获取其提及的内容（其他用户名称，前缀为@），如果提到的话用户也在文件中：

users = new Dictionary()
for each line in file:
   username = get_username(line)
   userid   = get_userid(line)
   users.add(key = userid, value = username)
for each line in file:
   mentioned_names = get_mentioned_names(line)
   mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
   print "$line | $mentioned_ids"

我已经使用GAWK处理该文件了，所以不是在Python或C中再次处理它，而是决定尝试将其添加到我的AWK脚本中。但是，我无法找到一种方法来传递相同的文件，为每个文件执行不同的代码。大多数解决方案都意味着多次调用AWK，但是我放弃了第一次传递的关联数组。

我可以用非常繁琐的方式（比如cat两次将文件传递给sed，以便为每个cat中的所有行添加不同的前缀}），但我希望能够在几个月内理解这段代码，而不会憎恨自己。

AWK的做法是什么？

PD：

我发现的不太可怕的方式：

function rewind(    i)
{
    # from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
    # shift remaining arguments up
    for (i = ARGC; i > ARGIND; i--)
        ARGV[i] = ARGV[i-1]

    # make sure gawk knows to keep going
    ARGC++

    # make current file next to get done
    ARGV[ARGIND+1] = FILENAME

    # do it
    nextfile
}

BEGIN {
 count = 1;
}

count == 1 {
 # first pass, fills an associative array
}

count == 2 {
 # second pass, uses the array
}

FNR == 30 { 
   # handcoded length, horrible
   # could also be automated calling wc -l, passing as parameter
  if (count == 1) {
        count = 2;
        rewind(1)
    }
}

Answer 1

在awk中处理两个单独文件或两次相同文件的惯用方法是这样的：

awk 'NR==FNR{ 
    # fill associative array 
    next
}
{
    # use the array
}' file1 file2

总记录号NR仅等于第一个文件上当前文件FNR的记录号。 next跳过第一个文件的第二个块。然后为第二个文件处理第二个块。如果file1和file2是同一个文件，则会通过该文件两次。

AWK：两次完成文件，执行不同的任务

PD：

1 个答案: