Question

我有一个数据文件。它看起来像：

INPUT1：

1 20022 44444 44444
2 31012 22233 44444
3 31012 22233 00444
4 20022 44444 00444
5 20022 44444 00444
6 20022 44444 00444
7 31012 44444 00444 
8 31012 44444 00444
9 31012 87634 44444
10 20022 87634 44444

我想将每列中的每个字符转换为子列，并且如果在该特定行中观察到子列，我想以它们所代表的方式将1或0放入行：

OUTPUT1：

    c1.20022 c1.31012 c2.44444 c2.22233 c2.87634 c3.44444 c3.00444
    1   1      0        1       0         0         1      0 
    2   0      1        0       1         0         1      0
    3   0      1        0       1         0         0      1
    4   1      0        1       0         0         0      1
    5   1      0        1       0         0         0      1
    6   1      0        1       0         0         0      1
    7   0      1        1       0         0         0      1
    8   0      1        1       0         0         0      1
    9   0      1        0       0         1         1      0
    10  1      0        0       0         1         1      0

我的真实数据有超过100000列和行。我还要提一下，我想在Linux中运行这个程序。

secound part：我想删除那些在每列中重复不到一百次的字符，我不想要那些子列。我的示例中的exaple input.file我想删除那些重复少于3次的字符：

输入2：

 1 20022 44444 44444
 2 31012  NA   44444
 3 31012  NA   00444
 4 20022 44444 00444
 5 20022 44444 00444
 6 20022 44444 00444
 7 31012 44444 00444 
 8 31012 44444 00444
 9 31012  NA   44444
10 20022  NA   44444

And output:

output2:
     c1.20022 c1.31012 c2.44444 c3.44444 c3.00444
    1   1      0        1         1      0 
    2   0      1        0         1      0
    3   0      1        0         0      1
    4   1      0        1         0      1
    5   1      0        1         0      1
    6   1      0        1         0      1
    7   0      1        1         0      1
    8   0      1        1         0      1
    9   0      1        0         1      0
    10  1      0        0         1      0

我应该在下面的答案中编写的shell脚本中更改，以便直接从我的第一个输入（input1）到达最后一个输出（output2）？

稍微更新一下：如果在我的输入中，每两行代表一个人（第1行和第2行属于个人1）：

1 20022 44444 44444
1 31012 44444 44444
2 31012 00000 00444
2 20022 44444 00444
3 20022 44444 00444
3 20022 44444 00444
4 31012 44444 00444 
4 31012 44444 00444
5 31012 11112 44444
5 20022 11112 44444

我希望在我的output.txt中每个人只重复一次，同时将每列中的每个字符转换为子列，并且我希望将它们中的2或1或0放在它们代表的方式中很多时候每个字符在每个人的子列中重复。同时我想删除每列中重复次数少于3次的字符（这里是第2列的00000和11112）：

output1.txt：

  c1.20022 c1.31012 c2.44444 c3.44444 c3.004444
1      1       1         2        2        0
2      1       1         1        0        2
3      1       0         1        0        2
4      0       2         2        0        2
5      1       1         0        2        0

这里我把数字编号之间的空格放在一起，以使其易于理解。但实际上并不需要这些空间（例如：第一行：1 11220）

Answer 1

作为一个非Fortran解决方案，我写了一个（g）awk脚本，它可以做你想要的，你的文件应该给它两次。在第一次运行中，它构建了每列中出现的标签数组，这是该过程中唯一占用大量内存的步骤。在后处理阶段，每个列都是一行一行地逐行处理，所以我猜它的效用取决于标题值的分布。

重要提示：该脚本使用语法labels[i][$i]的真正2d数组，而不是标准awk的{{3}}语法，以便能够循环超过第二个指数。这将在array[i,j]中有效，但其他awk种口味可能不支持它。

foo.awk：

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]=1;
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

前端bar.sh：

#!/bin/bash

infile=$1

gawk -f foo.awk $infile $infile

在将./bar.sh infile设置为可执行文件后由infile运行，其中“gawk -f foo8.awk infile infile”应替换为输入文件的实际名称。显然你可以跳过shell脚本并只调用printf，但是我太懒了，不能多次这样做。

另请注意，您可能希望删除printf命令中的大部分空格。那些在那里有一个漂亮的输出，但你可能不会手工查看输出，而是使用一些自动后处理方法。但是所有这些空白都会炸毁你最终的巨大文件。因此，我建议在每个c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444 1 1 0 1 0 0 0 1 2 0 1 0 0 1 0 1 3 0 1 0 0 1 1 0 4 1 0 1 0 0 1 0 5 1 0 1 0 0 1 0 6 1 0 1 0 0 1 0 7 0 1 1 0 0 1 0 8 0 1 1 0 0 1 0 9 0 1 0 1 0 0 1 10 1 0 0 1 0 0 1的开头保留一个空格，以便将列彼此分开，然后删除其余列。

输出：

labels[i][label]

更新

关于您更新的问题：

我想删除那些在每列中重复不到一百次的字符，我不想要任何子列。我的示例中的exaple input.file我想删除那些重复少于3次的字符

这是你的幸运日，因为上述脚本只需要进行微小的改动才能实现。为此，我们将foo.awk变量从指标更改为计数器，即当我们找到相同的标签时，我们会继续增加它们的值。然后在第二次运行期间，我们只需跳过最多出现2次的标签。

更新了#!/usr/bin/gawk #set up label array from first run NR==FNR{ for(i=2; i<=NF; i++){ labels[i][$i]++; #counter instead of indicator } } #do actual printing in second run NR!=FNR{ if(FNR==1){ #then print header printf " "; for(i=2; i<=NF; i++){ #i corresponds to columns in input for(label in labels[i]){ if(labels[i][label]<3) continue; #skip labels which appear at most 2 times printf " c%d.%s ",i-1,label}; #note i-1 } print ""; #newline }; printf "%10d", FNR; #column 1: line number for(i=2; i<=NF; i++){ for(label in labels[i]){ #loop over every possible label in column i if(labels[i][label]<3) continue; #skip labels which appear at most 2 times if($i==label){ printf " 1 "; #1 if same } else { printf " 0 "; #0 if different } }; } print ""; #newline }：

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1    1         0         1         0         1     
 2    0         1         0         0         1     
 3    0         1         0         1         0     
 4    1         0         1         1         0     
 5    1         0         1         1         0     
 6    1         0         1         1         0     
 7    0         1         1         1         0     
 8    0         1         1         1         0     
 9    0         1         0         0         1     
10    1         0         0         0         1

输出：

bar.sh

更新2

关于你的两次更新问题，

稍微更新：如果在我的输入中，每2行代表一个人（第1行和第2行属于个人1）：
...

现在您的数据分别跨越两行，并且您希望将它们一起处理。请注意，随着您的问题变得更加复杂，解决方案也会如此。为了避免并发症，我假设你每个人都有正好2行，这似乎就是这种情况。我还必须假设输入文件中的第一行以1开头。这似乎也是如此，但上述解决方案没有使用它。事实上，假设个体跨越1到个体总数的范围，没有间隙。它可以以更一般的方式完成，但我不想无缘无故地过度复杂化。

新#!/bin/bash infile=$1 cat $infile $infile |paste - - |gawk -f foo.awk：

foo.awk

这将使每对输入线彼此相邻，以便现在每个人只在一行上，然后将此修改后的文件两次送到foo.awk。

新#!/usr/bin/gawk #keep count of number of files (from first colum of first row) {if($1==1) nfiles++;} #set up label array from first run nfiles==1{ for(i=2; i<=NF/2; i++){ #go over first half columns labels[i][$i]++; #odd lines labels[i][$(i+NF/2)]++; #even lines } } #do actual printing in second run nfiles==2{ if($1==1){ #then print header printf " "; for(i=2; i<=NF/2; i++){ #i corresponds to columns in input for(label in labels[i]){ if(labels[i][label]<3) continue; #skip labels which appear at most 2 times printf " c%d.%s ",i-1,label}; #note i-1 } print ""; #newline }; printf "%10d ", $1; #column 1: line number for(i=2; i<=NF/2; i++){ for(label in labels[i]){ #loop over every possible label in column i if(labels[i][label]<3) continue; #skip labels which appear at most 2 times multi=0 #multiplicity of label "label" in line i if($i==label) multi++; if($(i+NF/2)==label) multi++; printf " %3d ", multi; }; } print ""; #newline }：

1 20022 44444 44444
1 31012 44444 44444
2 31012 00000 00444
2 20022 44444 00444
3 20022 44444 00444
3 20022 44444 00444
4 31012 44444 00444
4 31012 44444 00444
5 31012 11112 44444
5 20022 11112 44444

输入：

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1   1       1       2       0       2    
 2   1       1       1       2       0    
 3   2       0       2       2       0    
 4   0       2       2       2       0    
 5   1       1       0       0       2

输出：

printf " %3d    ", multi;

请注意，您可以通过更改

删除大部分无关的空白

printf "%d", multi;

到

 public class BankAccount
        {
           private double balance;

           public BankAccount()
           {
              balance = 0;
           }

           public BankAccount(double initialBalance)
           {
              balance = initialBalance;
           }

           public void deposit(double amount)
           {
              balance = balance + amount;
           }

           public void withdraw(double amount)
           {
              balance = balance - amount;
           }

           public double getBalance()
           {
              return balance;
           }
        }

并且还要注意我的示例输出与您的不同，但是根据您的规范，在我看来我的版本是正确的（例如，对于个人3，第一列中应该有“2”）。

如何将每列中的字符转换为子列而不重复

1 个答案:

更新

更新2