在bash中如何将多图<k,v>转换为<k,{v1,v2} =“”>的地图

时间:2017-06-06 01:58:11

标签: bash mapreduce

我正在处理bash中的文件输出,需要按键对值进行分组。

例如,我有

13,47099
13,54024
13,1
13,39956
13,0
17,126223
17,52782
17,4
17,62617
17,0
23,1022724
23,79958
23,80590
23,230
23,1
23,118224
23,0
23,1049
42,72470
42,80185
42,2
42,89199
42,0
54,70344
54,72824
54,1
54,62969
54,1

在文件中,并将特定键中的所有值分组为单行,如

13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1

我的输入文件中有大约10000个条目。如何在shell中转换此数据?

3 个答案:

答案 0 :(得分:4)

awk救援!

假设密钥是连续的......

$ awk -F, 'p!=$1 {if(a) print a; a=p=$1} 
                 {a=a FS $2} 
           END   {print a}' file

13,47099,54024,1,39956,0                                                                                                                  
17,126223,52782,4,62617,0                                                                                                                 
23,1022724,79958,80590,230,1,118224,0,1049                                                                                                
42,72470,80185,2,89199,0                                                                                                                  
54,70344,72824,1,62969,1    

答案 1 :(得分:0)

对于awk初学者来说,这里是@karakfa的代码的细目分类。我是根据玩具数据集file编写的:

1,X
1,Y
3,Z
  • p!=$1:检查模式p!=$1是否为true
    • 检查变量p是否等于file当前(第一行)的第一个字段(在这种情况下为1
    • 由于p在这一点上未定义,因此它不能等于1,因此p!=$1true,我们继续执行这行代码
  • if(a) print a:检查变量a是否存在,并打印变量a
    • 由于未定义a,因此print a命令未执行
  • a=p=$1:将变量ap设置为等于当前(第一行)行的第一个字段的值(在这种情况下为1
  • a=a FS $2:将变量a设置为等于a并与由行分隔符(1,X这种情况)
  • END:由于我们尚未到达file的结尾,因此我们跳过了这一行代码的其余部分
  • 移动到file的下一行(第二行),然后在该行上重新启动awk代码

  • p!=$1:检查模式p!=$1是否正确

    • 由于p1,并且当前(第二行)的第一字段是1p!=$1false,因此我们跳过了其余的这行代码
  • a=a FS $2:将a设置为等于a的值和由字段分隔符(1,X,Y分隔的当前(第二)行的第二个字段的值在这种情况下)
  • END:由于我们尚未到达file的结尾,因此我们跳过了这一行代码的其余部分
  • 移至file的下一行(第三行)并重新启动awk代码

  • p!=$1:检查模式p!=$1是否为true

    • 由于p1,第三行的$13p!=$1true,因此我们继续代码
  • if(a) print a:检查变量a是否存在,并打印a(如果存在)
    • 因为此时a1,X,Y,所以1,X,Y被打印到输出中
  • a=p=$1:将变量ap设置为等于当前(第三行)行的第一个字段的值(在这种情况下为3
  • a=a FS $2:将变量a设置为等于a并与由字段分隔符(3,Z这种情况)
  • END {print a}:由于我们已经到了file的结尾,因此请执行以下代码
    • print a:打印最后一组a(在这种情况下为3,Z

结果输出是

1,X,Y
3,Z

如果此说明中有任何错误,请告诉我。

答案 2 :(得分:0)

微调@karakfa的答案。如果希望键和值之间的分隔符不同于值之间的分隔符,则可以使用以下代码:

SQL> CREATE OR REPLACE directory ext_data as 'D:\test'; -- execute as sysdba

Directory created.

SQL> ho type l_costs.sql
----------------------------------------------------------------------------------------
--------file nanme l_costs.sql----------------------------------------------------------
--------Description:stand alone script to load costs table-------------------------------
--------Date:01/22/2020-----------------------------------------------------------------
SET FEEDBACK 1
SET NUMWIDTH 10
SET LINESIZE 80
SET TRIMSPOOL ON
SET TAB OFF
--SET PAGESIZE 100
SET VERIFY OFF
SET CONCAT '.'

SET PAGESIZE 0
--
-- COSTS
--
PROMPT creating costs100 dummy table for testing
CREATE TABLE sh.costs100
AS
SELECT * FROM sh.costs
WHERE 1>0;

PROMPT
PROMPT DROPPING TABLE sales_transactions_ext100
 -- you can omit this step once desc or select works on the external table

DROP TABLE sh.sales_transactions_ext100;
 PROMPT CREATING TABLE sales_transactions_ext100
PROMPT
/*
CREATE TABLE sales_transactions_ext
( PROD_ID         NUMBER,
  CUST_ID                 NUMBER,
  TIME_ID                 DATE,
  CHANNEL_ID      NUMBER,
  PROMO_ID                NUMBER,
  QUANTITY_SOLD   NUMBER,
  AMOUNT_SOLD     NUMBER(10,2),
  UNIT_COST       NUMBER(10,2),
  UNIT_PRICE      NUMBER(10,2)
)
ORGANIZATION external
(
  TYPE oracle_loader
 DEFAULT DIRECTORY data_file_dir
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII TERRITORY AMERICA
        BADFILE 'C:\sql\db-sample-schemas-Windows\sales_history\ext_lv3.bad'
        LOGFILE 'C:\sql\db-sample-schemas-Windows\sales_history\ext_lv3.log'
    FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '^' LDRTRIM
    ( PROD_ID         ,
      CUST_ID         ,
      TIME_ID         DATE(10) "YYYY-MM-DD",
      CHANNEL_ID      ,
      PROMO_ID        ,
      QUANTITY_SOLD   ,
      AMOUNT_SOLD     ,
      UNIT_COST       ,
      UNIT_PRICE
    )
)
LOCATION
 ('sale1v3.dat')

*/

CREATE TABLE sh.sales_transactions_ext100
( PROD_ID         NUMBER,
  CUST_ID                 NUMBER,
  TIME_ID                 DATE,
  CHANNEL_ID      NUMBER,
  PROMO_ID                NUMBER,
  QUANTITY_SOLD   NUMBER,
  AMOUNT_SOLD     NUMBER(10,2),
  UNIT_COST       NUMBER(10,2),
  UNIT_PRICE      NUMBER(10,2)
)
ORGANIZATION external
(
  TYPE oracle_loader
  DEFAULT DIRECTORY EXT_DATA
   ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII TERRITORY AMERICA
        BADFILE 'ext_lv3.bad'
        LOGFILE 'ext_lv3.log'
    FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '^' LDRTRIM
    ( PROD_ID         ,
      CUST_ID         ,
      TIME_ID         DATE(10) "YYYY-MM-DD",
      CHANNEL_ID      ,
      PROMO_ID        ,
      QUANTITY_SOLD   ,
      AMOUNT_SOLD     ,
      UNIT_COST       ,
      UNIT_PRICE
    )
  )
  LOCATION ('sale1v3.dat')
)
REJECT LIMIT 100;

PROMPT Verify external table created without any error
PROMPT
DESC sh.sales_transactions_ext100
PROMPT
PROMPT count the rows in sales_transactions_ext100
PROMPT
select count(*) from sh.sales_transactions_ext100;
PROMPT
PROMPT loading COSTS using external table
PROMPT


INSERT /*+ append */ INTO sh.costs100
( prod_id,
  time_id,
  channel_id,
  promo_id,
  unit_cost,
  unit_price )
SELECT
  prod_id,
  time_id,
  channel_id,
  promo_id,
  AVG(unit_cost),
  AVG(amount_sold/quantity_sold)
FROM
  sh.sales_transactions_ext100
GROUP BY
  prod_id,
  time_id,
  channel_id,
  promo_id;
commit;
PROMPT
PROMPT verify costs100 table loaded
PROMPT
PROMPT Total rows in sh.costs100
select count(*) from sh.costs100;

PROMPT Truncate table costs100 for next run
PROMPT
TRUNCATE TABLE sh.costs100;
PROMPT
PROMPT Verify table is empty
PROMPT
select count(*) from sh.costs100;
.


SQL> @l_costs
creating costs100 dummy table for testing
CREATE TABLE sh.costs100
                *
ERROR at line 1:
ORA-00955: name is already used by an existing object



DROPPING TABLE sales_transactions_ext100

Table dropped.

CREATING TABLE sales_transactions_ext


Table created.

Verify external table created without any error

           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      PROD_ID                                  NUMBER
    2      CUST_ID                                  NUMBER
    3      TIME_ID                                  DATE
    4      CHANNEL_ID                               NUMBER
    5      PROMO_ID                                 NUMBER
    6      QUANTITY_SOLD                            NUMBER
    7      AMOUNT_SOLD                              NUMBER(10,2)
    8      UNIT_COST                                NUMBER(10,2)
    9      UNIT_PRICE                               NUMBER(10,2)


count the rows in sales_transactions_ext100

    916039

1 row selected.


loading COSTS using external table


82112 rows created.


Commit complete.


verify costs100 table loaded

Total rows in sh.costs100
     82112

1 row selected.

Truncate table costs100 for next run


Table truncated.


Verify table is empty

         0

1 row selected.

SQL>