Question

我有两个文件，file1.csv

和file2.csv

5 1009
3 1010
1 1013

在shell中，我想根据第二列中的标识符，从第一个文件中减去第二个文件中第一列中的计数。如果第二列中缺少标识符，则假定计数为0.

结果将是

文件很大（几GB）。第二列已排序。

我如何在shell中有效地做到这一点？

Answer 1

假设两个文件都在第二列上排序：

$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014

join将加入已排序的文件 -j2将加入第二列 -a1将从file1打印记录，即使file2中没有相应的行 -a2与-a1相同，但申请了file2 在这种情况下，-oauto与-o1.2,1.1,2.1相同，后者将打印已加入的列，然后是file1和file2中的其余列。
-e0将插入0而不是空列。这适用于-a1和-a2。

join的输出是三列，如：

将其传送到awk，从第2列中减去第3列，然后重新格式化。

Answer 2

-- Description:   
-- 
-- VHDL Test Bench Created by ISE for module: keygenerator
-- 
-- Dependencies:
-- 
-- Revision:
-- Revision 0.01 - File Created
-- Additional Comments:
--
-- Notes: 
-- This testbench has been automatically generated using types std_logic and
-- std_logic_vector for the ports of the unit under test.  Xilinx recommends
-- that these types always be used for the top-level I/O of a design in order
-- to guarantee that the testbench will bind correctly to the post-implementation 
-- simulation model.
--------------------------------------------------------------------------------
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
--USE ieee.numeric_std.ALL;

ENTITY tb_keygenerator IS
END tb_keygenerator;

ARCHITECTURE behavior OF tb_keygenerator IS 

    -- Component Declaration for the Unit Under Test (UUT)

    COMPONENT keygenerator
    PORT(
         round : IN  std_logic_vector(3 downto 0);
         key : IN  std_logic_vector(127 downto 0);
         keyout1 : OUT  std_logic_vector(15 downto 0);
         keyout2 : OUT  std_logic_vector(15 downto 0);
         keyout3 : OUT  std_logic_vector(15 downto 0);
         keyout4 : OUT  std_logic_vector(15 downto 0);
         keyout5 : OUT  std_logic_vector(15 downto 0);
         keyout6 : OUT  std_logic_vector(15 downto 0)
        );
    END COMPONENT;


   --Inputs
   signal round : std_logic_vector(3 downto 0) := (others => '0');
   signal key : std_logic_vector(127 downto 0) := (others => '0');

   --Outputs
   signal out1 : std_logic_vector(15 downto 0);
   signal out2 : std_logic_vector(15 downto 0);
   signal out3 : std_logic_vector(15 downto 0);
   signal out4 : std_logic_vector(15 downto 0);
   signal out5 : std_logic_vector(15 downto 0);
   signal out6 : std_logic_vector(15 downto 0);
   -- No clocks detected in port list. Replace <clock> below with 
   -- appropriate port name 

   constant I_period : time := 10 ns;

BEGIN

  -- Instantiate the Unit Under Test (UUT)
   uut: keygenerator PORT MAP (
          round => round,
          key => key,
          keyout1 => out1,
          keyout2 => out2,
          keyout3 => out3,
          keyout4 => out4,
          keyout5 => out5,
          keyout6 => out6
        );

   -- Clock process definitions
   I_process :process
   begin
    key <= X"12345678912345678912345678912345";
    round <="1100";
    wait for I_period/2;
    key <= X"12345678912345678912345678912345";
    round <="1001";
    wait for I_period/2;
   end process;


END;

它读取内存中的第一个文件，因此您应该有足够的可用内存。如果你没有内存，我可能先$ awk 'NR==FNR { a[$2]=$1; next } { a[$2]-=$1 } END { for(i in a) print a[i],i }' file1 file2 7 1012 1 1013 8 1014 -2 1009 -3 1010个文件，然后sort -k2（合并）它们并继续输出：

sort -m

（我现在没时间了，也许我以后会完成它）

Ed Morton的编辑希望你不要在意我添加我正在处理的内容，而不是发布我自己非常相似的答案，随时修改或删除它：

$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2  # previous $2 = current $2 -> subtract
3 1010 2  # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1  # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1

Answer 3

由于文件已排序¹，您可以将它们与join中的coreutils实用程序逐行合并：

$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0

所有这些选项都是必需的：

-j2表示根据每个文件的第二列进行加入
-o auto说要使每一行都具有相同的格式，从连接键开始
-e 0表示缺少值应替换为零
-a 1和-a 2包含一个或另一个文件中缺少的行
文件名（我在这里使用了基于问题编号的名称）

现在我们有一个这种格式的输出流，我们可以在每一行上做减法。我使用这个GNU sed命令将上面的输出转换为dc程序：

sed -re 's/.*/c&-n[ ]np/e'

这将获取每行上的三个值，并将它们重新排列为减法的dc命令，然后执行它。例如，第一行变为（为清晰起见添加了空格）

c 1009 3 5 -n [ ]n p

从3减去5，打印它，然后打印一个空格，然后打印1009和换行符，给出

-2 1009

根据需要。

然后我们可以将所有这些行传递到dc，为我们提供我们想要的输出文件：

$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
>   | sed -e 's/.*/c& -n[ ]np/' \
>   | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014

¹排序需要与LC_COLLATE区域设置保持一致。如果字段始终为数字，则不太可能出现问题。

TL; DR

完整命令是：

join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc

它一次只能运行一行，并且只启动你看到的三个进程，因此在内存和CPU中应该合理有效。

Answer 4

假设这是一个空白分隔的csv，如果这是＆＃34;，＆＃34;使用参数-F ','

awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
             {Discounts[$2]=$1; ids[$2]++}
     END     { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
    ' file1.csv file2.csv

表示内存问题（可能是1系列管道，但更喜欢使用临时文件）

awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
 | sort -k2 \
 > file.tmp
awk 'Last != $2 { 
        if (NR != 1) print Result " "Last
        Last = $2; Result = $1
        }
    Last == $2 { Result+= $1; next}
    END { print Result " " $2}
    ' file.tmp
rm file.tmp

减去相应的行

4 个答案:

TL; DR