我的Awk命令排序,但意外地省略重复

时间:2017-02-28 02:26:54

标签: sorting awk

我试图按特定字段对此文件进行排序,我希望在awk中完成所有操作:

"firstName": "gdrgo",   "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John",   "lastName": "222",dfg
"xxxxx": "John",    "firstName": "beto",   "xxxxx": "John", "xxxxx": "John", "xxxxx": "John",   "lastName": "111","xxxxx": "John",
"xxxxx": "John",    "firstName": "beto",   "xxxxx": "John", "xxxxx": "John", "xxxxx": "John",   "lastName": "111","xxxxx": "John",
"xxxxx": "John",   "xxxxx": "John",    "firstName": "beto2", "xxxxx": "John","lastName": "555", "xxxxx": "John","xxxxx": "John",
"xxxxx": "John",   "xxxxx": "John",    "firstName": "beto2", "xxxxx": "John","lastName": "444", "xxxxx": "John","xxxxx": "John",
"firstName": "gdrgo",   "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John",   "lastName": "222",dfg
"xxxxx": "John",   "xxxxx": "John",    "firstName": "beto2", "xxxxx": "John","lastName": "444", "xxxxx": "John","xxxxx": "John",

我使用这个命令:

awk -F'.*"firstName": "|",.*"lastName": "|",' '{b[$3]=$0} END{for(i in b){print i}}' sumacomando

输出:

111
222
444
555

但我期待:

111
111
222
222
444
444    
555

也就是说,虽然实际输出看似按照需要排序,但却意外地丢失了重复值。

3 个答案:

答案 0 :(得分:2)

您选择的字段分隔符是非常规的,也许更好地使用它

awk -F'[:,]' '{for(i=1;i<=NF;i++) 
                  if($i~"\"lastName\"") 
                      {gsub(/"/,"",$(i+1)); 
                       print $(i+1)}}' file | sort

如果您的awk具有asort功能,则可以执行此操作

awk -F'[:,]' '{for(i=1;i<=NF;i++) 
                 if($i~"\"lastName\"") 
                    {gsub(/"/,"",$(i+1)); 
                     a[++c]=$(i+1)}} 
          END {asort(a); 
               for(k=1;k in a;k++) print a[k]}' file 

答案 1 :(得分:2)

  • awk数组中键/索引的排序,总是关联数组(字典),是一个实现细节 - 没有特定的顺序保证;在你的情况下,输出恰好是排序。

  • 键是唯一,因此如果多于1个输入行中的$3具有相同的值,则b[$3]=...分配会相互覆盖 - 最后一个获胜。

你因此:

  • 必须使用顺序索引的数组来存储第3个字段值($3

  • 必须按照以后的值对结果数组进行排序。

根据POSIX Awk规范,Awk没有内置的排序函数,但 GNU awk可以使用asort()函数启用以下解决方案:

awk -F'.*"firstName": "|",.*"lastName": "|",' '
  { b[++n]=$3 } END{ asort(b); for(i=1;i<=n;++i) print b[i] }
' sumacomando

请注意,这不包括存储关联的整行($0)。

如果您还希望在(GNU)Awk中执行排序时存储关联的完整行,则会变得更复杂:

awk -F'.*"firstName": "|",.*"lastName": "|",' '
  # Use a compound key to store the value of $3 plus a sequential index
  # to disambiguate, and store the input row ($0) as the value.
  { vals[$3,++n]=$0 }
  END{     
    # Sort by compound key using the helper function defined below.
    asorti(vals, names, "cmp_func");
    # Output the first half of the compound key, i.e., the value of $3,
    # followed by the associated input row.
    for(i=1;i<=n;++i) print gensub(SUBSEP ".*$", "", 1, names[i]), vals[names[i]]
  }
  # Helper sort function that splits the compound key into its components
  # - $3 value and sequential index - and compares the $3 values alphabetically
  # and the indices numerically.
  function cmp_func(i1, v1, i2, v2) {
    split(i1, tokens1, SUBSEP)
    split(i2, tokens2, SUBSEP)
    if (tokens1[1] < tokens2[1]) return -1
    if (tokens1[1] > tokens2[1]) return 1
    i1 = int(tokens1[2])
    i2 = int(tokens2[2])
    if (i1 < i2) return -1
    if (i1 > i2) return 1
    return 0
  }
' sumacomando

作为替代解决方案的管道sort大大简化了问题:

awk -F'.*"firstName": "|",.*"lastName": "|",' '{ print $3, $0 }' sumacomando | sort -k1,1

但请注意,上面的纯Awk解决方案会保留重复的$3值之间的输入顺序,sort辅助解决方案不会。

相反,纯Awk解决方案需要立即将所有输入存储在内存中,而sort实用程序已经过优化,可以处理大型输入集并按需使用临时文件。

答案 2 :(得分:1)

@victorhernandezzero:@try:我尝试了不同的方法,我希望它也可以帮助你/所有人。只有一个awk(没有其他命令)。

class Employee{
    private int id;
    public Employee(int i) {
        // TODO Auto-generated constructor stub
        this.id = i;
    }
}

public class HashMapExample {

    public static void main(String[] args) {
        HashMap<Employee,Integer> map =new HashMap<Employee,Integer>();
        map.put(new Employee(101),10);
        map.put(new Employee(101),20);

        System.out.println(map);
        Employee emp;
        for(Map.Entry<Employee,Integer> entry : map.entrySet()){
            System.out.println(entry.getKey()+"  "+entry.getValue());
             emp = entry.getKey();
            System.out.println(emp.equals(emp));
            emp.hashCode();

        }
    }

}

EDIT1:以上解决方案不会提供您需要的重复项,特别感谢mklement0让我知道,以下内容也可以帮助您。

awk '/lastName/{getline;while(!$0){getline};A[$0]} END{num=asorti(A, B);for(i=1;i<=num;i++){print B[i]}}' RS='[: ",]'   Input_file