修改

Question

我有一个awk脚本，当我将正则表达式放在不同的地方时表现不同。显然，我使程序的逻辑在两种情况下都一样，但事实并非如此。该脚本用于分析每个事务具有唯一ID的日志。日志看起来像

timestamp (ID) more info

例如：

2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction
2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with other information
2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with more information
2014-10-06 05:24:40,035 INFO  (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction

我想要的是处理某种类型交易的所有日志行，看看他们花了多少时间。每个事务都分布在多个日志行中，并由其唯一ID标识。要知道某个交易是否属于我想要的类型，我必须在该交易的第一行中搜索某些字符串。在日志中可以是没有上述格式的行。

我想要什么：

区分当前行是否为交易的一部分（具有ID）
检查ID是否已在累积数组中注册。
- 如果没有，请检查它是否属于所需类型：在行的正文中搜索固定字符串。
- 如果是，请注册时间戳，等等等等。

这是代码（注意这是一个非常缩小的版本）。

这是我想要使用的，首先检查它是否是一个交易行，并在检查后是否是正确的类型

awk '$4 ~ /^\([:alnum:]/
{
  name=$4;gsub(/[()]|:.*/,"",name);++matched
  if(!(name in arr)){
    if($0 ~ /transaction type/){arr[name]=1;print name}}
}END
{
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

该脚本只能找到868个事务，而且有一些超过14K。如果我将脚本更改为看起来像下面的代码，如果找到所有14k事务，但只查找所有14k事务的第一行，那么它对我没用。

awk '/transaction type/
{
  name=$4;gsub(/[()]|:.*/,"",name);++matched
  if(!(name in arr)){
    arr[name]=1;print name
   }
}END
{
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

提前致谢。

修改

羞辱我。本主题中存在多个实际问题。主要的是正则表达式与正确的字符串不匹配。 ID字符串和事务字符串的类型在同一行，这是真的，但在这些行上，ID就像（aaaaaabbbbbcccc :)，最后有两个空格。这让AWK解析了＆＃34;（aaaaaaaabbbbcccc：＆＃34;和＆＃34;）＆＃34;作为两个不同的领域。我意识到我做了什么

$4 !~ /regex/ print $4

出现了很多有效的身份证。

在修复正则表达式后出现的第二个问题已经在这里得到了解决。主要的正则表达式和第一个{在分隔的行中使awk打印每个记录。我意识到自己和同一天后我在这里读到了解决方案。惊人的。

非常感谢每一个人。我只能接受一个答案是有效的，但我从所有答案中学到了很多。

Answer 1

它只是语法错误。使用posix字符类时，必须将其括在方括号中：

[[:alnum:]]

否则[:alnum:]被视为包含: a l m n u

的字符类

Answer 2

白色空间在awk中很重要。这样：

/foo/ {
    print "found"
}

表示print 'found' every time "foo" is present，同时：

/foo/
{
    print "found"
}

表示print the current record every time "foo" is present and print "found" for every single input record，所以很可能是你写的时候：

$4 ~ /^\([:alnum:]/
{
     ....
}

你其实想写：

$4 ~ /^\([:alnum:]/ {
     ....
}

另外，您可能想要使用POSIX字符类[[:alnum:]]而不是字符集[ : a l n u m所描述的字符集[:alnum:]：

$4 ~ /^\([[:alnum:]]/ {
     ....
}

如果您修复了这些问题但仍需要帮助，请提供一些可测试的样本输入和预期输出，我们可以为您提供更多帮助。

Answer 3

所以简而言之，如果我理解得合适，你希望获得特定类型交易的ID。

第一个假设：id和事务类型在同一行，类似这样的事情（很大程度上改编自你的代码）

awk 'BEGIN {
  matched=0 # more for clarity than really needed
}
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only
  gsub(/[()]/,"",$4) # strip the () around the id
  ++matched # to get the number of matched lines including the multiples ones.
  if (!($4 in arr)) { # as yours, if the id is not in array
    arr[$4]=1 # add the found id to array for no including it twice
    print $4 # print the found id (only once as we're in the if
  }
}
END { # nothing changed here, printing the stats...
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

从样本输入中输出：

prompt=> awk 'BEGIN { matched=0}; / \([a-z0-9]*\) / { gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput
4aaaaaaaaabbbbbbcccb
4xxbbbbbbbbbbbbbcccb
Found: 2
Processed 4
4 lines matched the filter

我已经在测试中省略了交易，因为我不知道它可能是什么

awk regex：使用它或没有变量之间的区别

修改

3 个答案: