计数在PySpark

时间:2017-01-22 18:38:31

标签: python apache-spark mapreduce pyspark data-science

我正在使用以下格式的大数据处理PySpark作业。

ID-1234567  iplong  agent   partner client  country timestamp   category    reference

我需要在每个合作伙伴的一分钟时间间隔内根据列2(iplong), 3(agent), 5(client), 6(country), 9(reference)找到平均重复记录数量。

我知道我需要

  1. 将记录分为一分钟。
  2. partner
  3. 映射所有内容
  4. partner
  5. 对所有内容进行分组
  6. 按记录总数和不同记录的数量减少每个间隔,并取差异以获得重复记录的数量(还需要定义一个函数,仅将两个记录与2(iplong), 3(agent), 5(client), 6(country), 9(reference)列的值进行比较。)
  7. 将所有间隔中的所有partner及其重复计数加在一起。除以外表的数量。
  8. 我理解这个过程,但不了解pyspark中的确切实现。

    有人可以帮助我在pyspark中执行上述任何步骤。

    示例数据:

    9794474 1000460030  Samsung_S5233   dv4gs   dswae   in  2012-03-08 00:00:00 mg  riflql2a0yv8xoa9sq0recx4x
    9794471 3386480130  Nokia_C3-00 duq7h   dr75h   py  2012-03-08 00:00:00 co  
    9794468 1907980030  Nokia_5233  dv6i3   ds3xq   vn  2012-03-08 00:00:00 es  gp53lqr9njqd6z2ap5d364sip
    9794467 1791990020  MAUI    duxto   dvb8g   in  2012-03-08 00:00:00 ad  
    9794466 1791000060  Nokia_3110c dusg4   dvb8g   in  2012-03-08 00:00:00 ad  
    9794477 1353590020  Blackberry_9300 du6dt   dtr0u   es  2012-03-08 00:00:00 es  h5njsswvxorsau9u8fxh0e9se
    9794478 1402290050  NokiaC6-01.3    dusnc   dsgcn   ru  2012-03-08 00:00:00 mc  
    9794481 1848749950  Nokia_C3-00 dvry3   dr6sg   th  2012-03-08 00:00:01 mc  oj0rekb51pvirnjuqjt10zn4b
    

    更新

    到目前为止,我已经尝试将整个数据放入MySQL并从中读取。但是在读操作中需要花费太多时间。

    对于mapreduce方法,我尝试过不同的小东西。但是我不明白我将如何在代码中进一步处理它。因此,无法通过一种方法向前推进。

    clicks_rdd = sc.parallelize(list(clicks_reader)[1:]) 
    minwise_clicks = clicks_rdd.groupby(clicks_rdd.index.map(lambda t: t.minute)) # Didn't work
    clicks_mapped_publishers = clicks_rdd.map(lambda x : (x.pop(3), x)) # Works fine but need the records divided into minute intervals first.
    

    还在这里和那里尝试过其他一些东西。但没什么可靠的。

    以下是我原始数据集文件的前25条记录。

    id,iplong,agent,partnerid,cid,cntr,timeat,category,referer
    9794476,1071324855,SonyEricsson_K70,dv3va,dsfag,us,2012-03-08 00:00:00.0,ad,
    9794474,1000461055,Samsung_S5233,dv4gs,dswae,in,2012-03-08 00:00:00.0,mg,riflql2a0yv8xoa9sq0recx4x
    9794471,3386484265,Nokia_C3-00,duq7h,dr75h,py,2012-03-08 00:00:00.0,co,
    9794468,1907981997,Nokia_5233,dv6i3,ds3xq,vn,2012-03-08 00:00:00.0,es,gp53lqr9njqd6z2ap5d364sip
    9794467,1791989091,MAUI,duxto,dvb8g,in,2012-03-08 00:00:00.0,ad,
    9794466,1791002478,Nokia_3110c,dusg4,dvb8g,in,2012-03-08 00:00:00.0,ad,
    9794477,1353590316,Blackberry_9300,du6dt,dtr0u,es,2012-03-08 00:00:00.0,es,h5njsswvxorsau9u8fxh0e9se
    9794478,1402285217,NokiaC6-01.3,dusnc,dsgcn,ru,2012-03-08 00:00:00.0,mc,
    9794481,1848747204,Nokia_C3-00,dvry3,dr6sg,th,2012-03-08 00:00:01.0,mc,oj0rekb51pvirnjuqjt10zn4b
    9794482,1893182670,NokiaC2-03,du77a,dr6x2,id,2012-03-08 00:00:01.0,co,r63f8uhijvr2irvka3glwyb38
    9794483,1912930086,MAUI,dvwdj,dvb8g,id,2012-03-08 00:00:01.0,ad,
    9794485,2098816838,GT-S5360B,dvjtq,dr72e,th,2012-03-08 00:00:01.0,co,
    9794486,3309473440,MAUI,dv6i3,ds3k0,za,2012-03-08 00:00:01.0,es,
    9794492,702295934,Nokia_9300,dv6i3,dtqrw,ng,2012-03-08 00:00:01.0,es,onbw7na2mi8a62g4p6y3av2qt
    9794493,694135362,Nokia_N95,dupgf,dvb8g,sd,2012-03-08 00:00:01.0,ad,hoq05psulkszxm4izlql4g962
    9794495,1791428359,Samsung_S8300,dvpo7,dvb8g,in,2012-03-08 00:00:02.0,co,im387req0zp1ucygamhgadgtm
    9794496,1783607271,GT-S5570,du56s,dsgq2,in,2012-03-08 00:00:02.0,mc,immfap8948rebeym8ri0vf5cr
    9794498,1860189232,Samsung_GT-B3313,du56s,ds22r,in,2012-03-08 00:00:02.0,mc,r81nrzjemr5jrfvjjeoxmdm4y
    9794499,1868310973,Nokia_2730c,dv3va,drvnr,au,2012-03-08 00:00:02.0,ad,
    9794500,1893182511,Nokia_5233,dv6i7,dr6tn,id,2012-03-08 00:00:02.0,co,tq09jycwii12iul7hzalucue3
    9794501,1884230403,Samsung_GT-S3653,dvjil,ds92x,in,2012-03-08 00:00:02.0,mc,h0z1j3bwiverubvwg851e9eon
    9794503,1945382244,GT-S5360,dvijt,dsgq2,in,2012-03-08 00:00:02.0,mc,fbbenjzmoe0oc7x4e2080nj8x
    9794508,2928534854,Samsung_R310,dunsq,dsg3q,us,2012-03-08 00:00:02.0,ad,kl9j183hop90uwq2p82iidjsb
    9794510,3063717709,Samsung_GT-S3653,dvjjf,dr751,in,2012-03-08 00:00:02.0,ad,rpdt9h4kpooxiedeuuxvk6gi5
    9794511,3557769762,Samsung_C3050,du53k,dr71b,hr,2012-03-08 00:00:02.0,se,
    

    更新2

    示例输出。这是制表符分隔值格式。您可以将其复制并粘贴到Excel中以便正确查看。此处avg_spiky_ReAgCnIpCi是每秒重复referenceAgentCountryIPClient组合的平均数。我感兴趣的是。然后我可以进行更改以获得其他功能。

    partnerid   status  avg_spiky_ReAgCnIpCi    std_spiky_ReAgCnIpCi    night_avg_spiky_ReAgCnIpCi  night_std_spiky_ReAgCnIpCi  morning_avg_spiky_ReAgCnIpCi    morning_std_spiky_ReAgCnIpCi    afternoon_avg_spiky_ReAgCnIpCi  afternoon_std_spiky_ReAgCnIpCi  evening_avg_spiky_ReAgCnIpCi    evening_std_spiky_ReAgCnIpCi    avg_spiky_ReAgCnIp  std_spiky_ReAgCnIp  avg_spiky_ReAgCn    std_spiky_ReAgCn    avg_spiky_iplong    std_spiky_iplong    avg_spiky_agent std_spiky_agent night_avg_spiky_agent   night_std_spiky_agent   morning_avg_spiky_agent morning_std_spiky_agent afternoon_avg_spiky_agent   afternoon_std_spiky_agent   evening_avg_spiky_agent evening_std_spiky_agent avg_spiky_cid   std_spiky_cid   avg_spiky_cntr  std_spiky_cntr  avg_spiky_referer   std_spiky_referer   night_avg_spiky_referer night_std_spiky_referer morning_avg_spiky_referer   morning_std_spiky_referer   afternoon_avg_spiky_referer afternoon_std_spiky_referer evening_avg_spiky_referer   evening_std_spiky_referer   category_es category_mc category_ad category_co category_se category_mg category_pp category_in category_gd category_ow total_clicks    distinct_iplong distinct_agent  distinct_cid    distinct_cntr   distinct_referer    night_click_percent morning_click_percent   afternoon_click_percent evening_click_percent   night_referer_percent   morning_referer_percent afternoon_referer_percent   evening_referer_percent night_agent_percent morning_agent_percent   afternoon_agent_percent evening_agent_percent   avg_total_clicks    std_total_clicks    avg_distinct_iplong std_distinct_iplong avg_distinct_agent  std_distinct_agent  avg_distinct_cid    std_distinct_cid    avg_distinct_cntr   std_distinct_cntr   avg_distinct_referer    std_distinct_referer    avg_null_agent  std_null_agent  avg_null_referer    std_null_referer    night_avg_null_referer  night_std_null_referer  morning_avg_null_referer    morning_std_null_referer    afternoon_avg_null_referer  afternoon_std_null_referer  evening_avg_null_referer    evening_std_null_referer    first_15_minute_percent second_15_minute_percent    third_15_minute_percent last_15_minute_percent  brand_MAUI_percent  brand_Nokia_percent brand_Generic_percent   brand_Apple_percent brand_Blackberry_percent    brand_Samsung_percent   brand_SonyEricsson_percent  brand_LG_percent    brand_other_percent avg_per_hour_density    std_per_hour_density    cntr_az_percent cntr_id_percent cntr_in_percent cntr_us_percent cntr_ng_percent cntr_tr_percent cntr_ru_percent cntr_th_percent cntr_sg_percent cntr_uk_percent cntr_other_percent
    du3nk   0   1.23    8.47    0   0   0   0   0   0   1.23    8.47    1.24    8.48    1.27    8.61    4.14    11.73   8.73    16.06   0   0   0   0   0   0   8.73    16.06   38.18   240.99  60  248 1.8 10.35   0   0   0   0   0   0   1.8 10.35   0   1   0   0   0   0   0   0   0   0   3360    644 250 61  31  1696    0   0   0   1   0   0   0   1   0   0   0   1   3360    0   644 0   250 0   61  0   31  0   1696    0   0   0   598 0   0   0   0   0   0   0   598 0   0.16    0.17    0.33    0.35    0.01    0   0.05    0   0   0   0   0   0   2   0   0   0   0.13    0   0   0   0   0   0.01    0   0.04
    du3nq   1   8.38    5.83    0   0   0   0   0   0   8.38    5.83    25.13   9.27    25.13   9.27    188.5   49.5    188.5   49.5    0   0   0   0   0   0   188.5   49.5    53.86   39.03   188.5   49.5    25.13   9.27    0   0   0   0   0   0   25.13   9.27    1   0   0   0   0   0   0   0   0   0   377 1   1   5   1   8   0   0   0   1   0   0   0   1   0   0   0   1   377 0   1   0   1   0   5   0   1   0   8   0   0   0   0   0   0   0   0   0   0   0   0   0   0.09    0.14    0.33    0.44    0   0   0   1   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
    du3op   0   30.43   46.87   0   0   0   0   44.67   59.63   19.75   30.19   35.5    48.84   35.5    48.84   71  52.27   71  52.27   0   0   0   0   134 0   39.5    33.5    13.31   8.24    71  52.27   35.5    48.84   0   0   0   0   67  62  19.75   30.19   0   0   1   0   0   0   0   0   0   0   213 1   1   6   1   1   0   0   0.63    0.37    0   0   1   1   0   0   1   1   213 0   1   0   1   0   6   0   1   0   1   0   0   0   205 0   0   0   0   0   129 0   76  0   0   0.09    0.25    0.66    0   1   0   0   0   0   0   0   0   3   0   0   0   0   0   0   0   0   0   0   0   1
    du3or   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   1   0   1   0   0   1   0   0   0   0   0   0   0   0   2   2   1   1   1   1   0   0   0.5 0.5 0   0   1   1   0   0   1   1   2   0   2   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0.5 0   0   0.5 0   0   0   0   0   0   1   0   0   2   0   0   1   0   0   0   0   0   0   0   0   0
    du3ov   0   1.01    0.11    0   0   0   0   0   0   1.01    0.11    1.01    0.11    1.01    0.11    44  30  29.33   31.63   0   0   0   0   0   0   29.33   31.63   6.29    5.59    44  30  1.02    0.21    0   0   0   0   0   0   1.02    0.21    0   0   0   0   1   0   0   0   0   0   88  1   2   10  1   86  0   0   0   1   0   0   0   1   0   0   0   1   88  0   1   0   2   0   10  0   1   0   86  0   0   0   0   0   0   0   0   0   0   0   0   0   0.84    0   0   0.16    0   0.94    0   0.06    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
    du3ox   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   1   1   1   1   1   1   0   0   0   1   0   0   0   1   0   0   0   1   1   0   1   0   1   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1
    du3oy   0   1.02    0.12    0   0   0   0   0   0   1.02    0.12    1.02    0.15    1.02    0.15    64.5    31.5    32.25   35.55   0   0   0   0   0   0   32.25   35.55   7.59    6.03    64.5    31.5    1.03    0.28    0   0   0   0   0   0   1.03    0.28    0   0   0   0   1   0   0   0   0   0   129 1   3   12  1   124 0   0   0   1   0   0   0   1   0   0   0   1   129 0   1   0   3   0   12  0   1   0   124 0   0   0   0   0   0   0   0   0   0   0   0   0   0.26    0.58    0.16    0   0   0.95    0   0.04    0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   1
    du3oz   1   1   0   0   0   0   0   1   0   0   0   1   0   33  3.35    1.01    0.08    165 0   0   0   0   0   165 0   0   0   27.5    8.18    165 0   33  3.35    0   0   0   0   33  3.35    0   0   1   0   0   0   0   0   0   0   0   0   165 164 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   165 0   164 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0
    du3p1   0   1   0   0   0   0   0   1   0   0   0   1   0   18.2    16.11   1.01    0.07    91  80  0   0   0   0   91  80  0   0   15.17   14.82   91  80  18.2    16.11   0   0   0   0   18.2    16.11   0   0   1   0   0   0   0   0   0   0   0   0   182 181 1   6   1   5   0   0   1   0   0   0   1   0   0   0   1   0   182 0   181 0   1   0   6   0   1   0   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0.06    0   0   0.94    0   1   0   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   1   0   0
    du3r7   0   3.63    1.32    0   0   0   0   0   0   3.63    1.32    29  0   29  0   29  0   29  0   0   0   0   0   0   0   29  0   3.63    1.32    29  0   29  0   0   0   0   0   0   0   29  0   0   0   0   0   1   0   0   0   0   0   29  1   1   8   1   1   0   0   0   1   0   0   0   1   0   0   0   1   29  0   1   0   1   0   8   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0
    

1 个答案:

答案 0 :(得分:1)

初​​始化:

from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import functions as f

这相当于我原始数据集文件的前25条记录。

df = spark.read.load(path="file:///home/zht/PycharmProjects/test/disk_file", format='csv', sep=',', header=True)

只是为了获得非凡的结果,此步骤可以忽略

df = df.withColumn('iplong', f.substring('iplong', pos=0, len=1)) \
    .withColumn('agent', f.substring('agent', pos=0, len=1)) \
    .withColumn('client', f.substring('client', pos=0, len=2)) \
    .withColumn('partner', f.substring('partner', pos=0, len=2)) \
    .withColumn('timestamp',f.when(f.substring('id', pos=6, len=1) % 2 == 1, '2012-03-08 00:01:00.0').otherwise(df['timestamp']))
df.show()

+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|     id|iplong|agent|partner|client|country|           timestamp|category|           reference|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+
|9794476|     1|    S|     dv|    ds|     us|2012-03-08 00:01:...|      ad|                null|
|9794474|     1|    S|     dv|    ds|     in|2012-03-08 00:01:...|      mg|riflql2a0yv8xoa9s...|
|9794471|     3|    N|     du|    dr|     py|2012-03-08 00:01:...|      co|                null|
|9794468|     1|    N|     dv|    ds|     vn|2012-03-08 00:00:...|      es|gp53lqr9njqd6z2ap...|
|9794467|     1|    M|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794466|     1|    N|     du|    dv|     in|2012-03-08 00:00:...|      ad|                null|
|9794477|     1|    B|     du|    dt|     es|2012-03-08 00:01:...|      es|h5njsswvxorsau9u8...|
|9794478|     1|    N|     du|    ds|     ru|2012-03-08 00:01:...|      mc|                null|
|9794481|     1|    N|     dv|    dr|     th|2012-03-08 00:00:...|      mc|oj0rekb51pvirnjuq...|
|9794482|     1|    N|     du|    dr|     id|2012-03-08 00:00:...|      co|r63f8uhijvr2irvka...|
|9794483|     1|    M|     dv|    dv|     id|2012-03-08 00:00:...|      ad|                null|
|9794485|     2|    G|     dv|    dr|     th|2012-03-08 00:00:...|      co|                null|
|9794486|     3|    M|     dv|    ds|     za|2012-03-08 00:00:...|      es|                null|
|9794492|     7|    N|     dv|    dt|     ng|2012-03-08 00:01:...|      es|onbw7na2mi8a62g4p...|
|9794493|     6|    N|     du|    dv|     sd|2012-03-08 00:01:...|      ad|hoq05psulkszxm4iz...|
|9794495|     1|    S|     dv|    dv|     in|2012-03-08 00:01:...|      co|im387req0zp1ucyga...|
|9794496|     1|    G|     du|    ds|     in|2012-03-08 00:01:...|      mc|immfap8948rebeym8...|
|9794498|     1|    S|     du|    ds|     in|2012-03-08 00:01:...|      mc|r81nrzjemr5jrfvjj...|
|9794499|     1|    N|     dv|    dr|     au|2012-03-08 00:01:...|      ad|                null|
|9794500|     1|    N|     dv|    dr|     id|2012-03-08 00:00:...|      co|tq09jycwii12iul7h...|
+-------+------+-----+-------+------+-------+--------------------+--------+--------------------+

关键操作:

res = df.groupBy([f.window('timestamp', windowDuration='1 minutes'),'partner', 'iplong', 'agent']).count()
res = res.withColumn('total',f.sum('count').over(Window.partitionBy(["window", "partner"])))
res.show(n=30, truncate=False)

+---------------------------------------------+-------+------+-----+-----+-----+
|window                                       |partner|iplong|agent|count|total|
+---------------------------------------------+-------+------+-----+-----+-----+
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |3     |S    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |6     |N    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |B    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |G    |1    |7    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|du     |1     |S    |1    |7    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |3     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |N    |3    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |2     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |G    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |M    |1    |8    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|dv     |1     |S    |1    |8    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |3     |S    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |7     |N    |1    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |S    |3    |6    |
|[2012-03-08 00:01:00.0,2012-03-08 00:02:00.0]|dv     |1     |N    |1    |6    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |2     |S    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |M    |1    |4    |
|[2012-03-08 00:00:00.0,2012-03-08 00:01:00.0]|du     |1     |N    |2    |4    |
+---------------------------------------------+-------+------+-----+-----+-----+

计数表示每1分钟记录的数量&伙伴& iplong&剂

总计表示每1分钟和每1分钟的记录数量。合作伙伴

你是说这个吗?