Spark:在数据框中从电子邮件地址中提取域名

时间:2018-06-19 07:39:52

标签: scala apache-spark dataframe

我在提取电子邮件域方面遇到了困难。我有以下数据框。

+---+----------------+
|id |email           |
+---+----------------+
|1  |ii@koko.com     |
|2  |lol@fsa.org     |
|3  |kokojambo@mon.eu|
+---+----------------+

现在我想为域名提供一个新字段:

+---+----------------+------+
|id |email           |domain|
+---+----------------+------+
|1  |ii@koko.com     |koko  |
|2  |lol@fsa.org     |fsa   |
|3  |kokojambo@mon.eu|mon   |
+---+----------------+------+

我试着这样做:

val test = df_short.withColumn("email", split($"email", "@."))

但得到了错误的输出。任何人都可以更好地指导我吗?

2 个答案:

答案 0 :(得分:1)

你可以这样做

with Pronum_tab AS
(

 select  
CONCAT(ouk_shipment.pipeline_ref_id,'A') AS Pronum, AMT1,AMT2 ,AMT3 ,AMT4 ,AMT5,AMT6 ,AMT7,

 case when TO_CHAR(AMT1) is not null then '400' else null END AS CODE1,

 case when TO_CHAR(AMT2) is not null  then '405' else null END  AS CODE2,

 case when TO_CHAR(AMT3) is not null then 'FAS' ELSE null  end AS CODE3,

 case WHEN TO_CHAR(AMT4) IS NOT NULL THEN  'SHB' ELSE null  end AS CODE4,

 case WHEN TO_CHAR(AMT5) IS NOT NULL THEN '450' ELSE null  end AS CODE5,

case WHEN TO_CHAR(AMT6) IS NOT NULL THEN '310' ELSE null  end AS CODE6,

 case WHEN TO_CHAR(AMT7) IS NOT NULL THEN 'HAZ' ELSE null END AS CODE7,

 (AMT1+AMT2+AMT3+AMT4+AMT5+AMT6+AMT7) AS BilledAmt,

 0 AS T1_TaxRate1,T2_TaxRate2,0 AS T3_TaxRate3,T2_TaxRate2 AS T4_TaxRate4,
 TO_CHAR(ouk_shipment.ata,'YYYY-MM-DD') AUXDATE1,


CASE WHEN pipeline_charges.charge_code in ('FRTRV','FSCRV','SECURFEERV','SECFEERV','HANRV','INLRV','DRYICERV','DGARV') THEN pipeline_charges.invoice_currency_code END AS Currency,


CASE WHEN pipeline_charges.charge_code in('FCFAD','FCFOAD') THEN pipeline_charges.base_cost_exchange_rate END AS EXCHANGE_RATE,

ouk_shipment.pipeline_ref_id BILL_OF_LADING,

ouk_shipment.ata AS DELDATE,
ouk_shipment.shipper_name AS Shipper,

ouk_shipment.orig_country_code AS OCountryCode,
ouk_shipment.port_loading AS OPortCode,
ouk_shipment.total_actual_weight As Weight,
ouk_shipment.chargeable_weight AS DimWeight,
ouk_shipment.total_volume_weight AS Volume1,
ouk_shipment.num_pieces AS Pieces
 FROM (

SELECT


 SUM(CASE WHEN pipeline_charges.charge_code = ('FRTRV') THEN pipeline_charges.invoice_charge_amt END) AS AMT1,

 SUM(CASE WHEN pipeline_charges.charge_code = ('FSCRV') THEN pipeline_charges.invoice_charge_amt END) AS AMT2,

 SUM(CASE WHEN pipeline_charges.charge_code IN ('SECURFEERV','SECFEERV') THEN pipeline_charges.invoice_charge_amt END) AS AMT3,

 SUM(CASE WHEN pipeline_charges.charge_code = ('HANRV') THEN pipeline_charges.invoice_charge_amt END) AS AMT4,

 SUM(CASE WHEN pipeline_charges.charge_code = ('INLRV') THEN pipeline_charges.invoice_charge_amt END) AS AMT5,

 SUM(CASE WHEN pipeline_charges.charge_code = ('DRYICERV') THEN pipeline_charges.invoice_charge_amt END) AS AMT6,

 SUM(CASE WHEN pipeline_charges.charge_code = ('DGARV') THEN pipeline_charges.invoice_charge_amt END) AS AMT7,

 SUM(CASE WHEN pipeline_charges.charge_code IN ('FCFAD','FCFOAD') THEN pipeline_charges.invoice_charge_amt END) AS T2_TaxRate2




from pipeline_charges,ouk_shipment


where pipeline_charges.ppc_ind='C'

and pipeline_charges.import_export_ind IN ('E')

and pipeline_charges.pipeline_tx_id=ouk_shipment.pipeline_tx_id

and pipeline_charges.charge_code in ('FRTRV','FSCRV','SECURFEERV','SECFEERV','HANRV','INLRV','DRYICERV','DGARV')

AND (ouk_shipment.ata BETWEEN to_date('&1','DD-MON-YYYY:HH24:MI:SS')

                                AND to_date('&2','DD-MON-YYYY:HH24:MI:SS'))




                                ), ouk_shipment,pipeline_charges

where pipeline_charges.ppc_ind='C'

and pipeline_charges.import_export_ind IN ('E')

and pipeline_charges.pipeline_tx_id=ouk_shipment.pipeline_tx_id

--and pipeline_charges.charge_code in ('FRTRV','FSCRV','SECURFEERV','SECFEERV','HANRV','INLRV','DRYICERV','DGARV')

AND (ouk_shipment.ata BETWEEN to_date('&1','DD-MON-YYYY:HH24:MI:SS')

                                AND to_date('&2','DD-MON-YYYY:HH24:MI:SS'))








                                UNION





select   

CONCAT(ouk_shipment.pipeline_ref_id,'B') AS Pronum,AMT1,AMT2 ,AMT3 ,AMT4 ,AMT5,AMT6 ,AMT7,

 case when TO_CHAR(AMT1) is not null then '362' else null END AS CODE1,

 case when TO_CHAR(AMT2) is not null  then '750' else null END  AS CODE2,

 case when TO_CHAR(AMT3) is not null then null ELSE null  end AS CODE3,

 case WHEN TO_CHAR(AMT4) IS NOT NULL THEN  null  ELSE null  end AS CODE4,

 case WHEN TO_CHAR(AMT5) IS NOT NULL THEN null ELSE null  end AS CODE5,

case WHEN TO_CHAR(AMT6) IS NULL THEN null ELSE null  end AS CODE6,

 case WHEN TO_CHAR(AMT7) IS NULL THEN null ELSE null END AS CODE7,

 (AMT1+AMT2+AMT3+AMT4+AMT5+AMT6+AMT7) AS BilledAmt,

 AMT1 AS T1_TaxRate1, 0 as T2_TaxRate2, AMT2 AS T3_TaxRate3, (AMT1+AMT2) AS T4_TaxRate4,
 TO_CHAR(ouk_shipment.ata,'YYYY-MM-DD') AUXDATE1,


CASE WHEN pipeline_charges.charge_code in ('HANRV') THEN pipeline_charges.invoice_currency_code END AS Currency,


Null AS EXCHANGE_RATE,

ouk_shipment.pipeline_ref_id BILL_OF_LADING,

ouk_shipment.ata AS DELDATE,
ouk_shipment.shipper_name AS Shipper,

ouk_shipment.orig_country_code AS OCountryCode,
ouk_shipment.port_loading AS OPortCode,
ouk_shipment.total_actual_weight As Weight,
ouk_shipment.chargeable_weight AS DimWeight,
ouk_shipment.total_volume_weight AS Volume1,
ouk_shipment.num_pieces AS Pieces
 FROM (

SELECT

Sum(CASE WHEN pipeline_charges.charge_code=('HANRV') THEN pipeline_charges.invoice_charge_amt END) AS AMT1,

SUM(CASE WHEN pipeline_charges.charge_code = ('HANRV') THEN pipeline_charges.invoice_charge_amt END)*0.1 AS AMT2,

0 AS AMT3,

0 AS AMT4,

0 AS AMT5,

0 AS AMT6,

0 AS AMT7



from pipeline_charges,ouk_shipment


where pipeline_charges.ppc_ind='C'

and pipeline_charges.import_export_ind IN ('I')

and pipeline_charges.pipeline_tx_id=ouk_shipment.pipeline_tx_id

and pipeline_charges.charge_code = ('HANRV')

AND (ouk_shipment.ata BETWEEN to_date('&1','DD-MON-YYYY:HH24:MI:SS')

                                AND to_date('&2','DD-MON-YYYY:HH24:MI:SS'))



                                ), ouk_shipment,pipeline_charges

where pipeline_charges.ppc_ind='C'

and pipeline_charges.import_export_ind IN ('I')

and pipeline_charges.pipeline_tx_id=ouk_shipment.pipeline_tx_id
--
--and pipeline_charges.charge_code = ('HANRV')

AND (ouk_shipment.ata BETWEEN to_date('&1','DD-MON-YYYY:HH24:MI:SS')

                                AND to_date('&2','DD-MON-YYYY:HH24:MI:SS'))





    )

    Select distinct 'K550' AS Scac,'5301' AS CUSTNO,BILL_OF_LADING,Pronum,'' AS MB_MasterBol,DELDATE,Shipper,'' AS Oaddr1,'' AS Ocity,'' AS OstateProvience,OCountryCode,OPortCode,
'' AS L1_OriginRegion,'BD KOREA' AS Consignee,'' AS Daddr1,'SEOUL'AS Dcity,'' AS DstateProvience,'22383' AS Dpostcode,
'KR' AS DcountryCode,'ICN' AS DPortCode,'' AS L2_DestinationRegion,'' AS SvcLevel,Weight,DimWeight,'KG' AS WeightUom,Volume1,'X' As VolumeUom,'PLT' as PackageType,Pieces,'' AS SecondaryCarrier,'1002294' AS BillToNum,'A' as Mode1, CODE1,AMT1,'' AS RateQualifier1,'' AS AuditAux1,'' AS RateValue1,CODE2,AMT2 ,'' AS RateQualifier2,'' AS AuditAux2,'' AS RateValue2,
CODE3,AMT3 ,'' AS RateQualifier3,'' AS AuditAux3,'' AS RateValue3,CODE4,AMT4 ,'' AS RateQualifier4,'' AS AuditAux4,'' AS RateValue4,
CODE5,AMT5,'' AS RateQualifier5,'' AS AuditAux5,'' AS RateValue5,CODE6,AMT6 ,'' AS RateQualifier6,'' AS AuditAux6,'' AS RateValue6,
CODE7,AMT7,'' AS RateQualifier7,'' AS AuditAux7,'' AS RateValue7,Currency,BilledAmt,EXCHANGE_RATE,AUXDATE1,T1_TaxRate1,T2_TaxRate2,T3_TaxRate3,T4_TaxRate4,
'KRW' AS Currency2,'' AS CR_ShipperRefNum2,'' AS CR_ShipperRefNum3,'' AS CR_ShipperRefNum4,'' AS CR_ShipperRefNum5,'' AS CR_ShipperRefNum6,'' AS CR_ShipperRefNum7,'' AS CR_ShipperRefNum8,'' AS CR_ShipperRefNum9,
'' AS CR_ShipperRefNum10,'' AS CR_ShipperRefNum11,'' AS CR_ShipperRefNum12,'' AS CR_ShipperRefNum13,'' AS CR_ShipperRefNum14,'' AS CR_ShipperRefNum15,'105-81-76726' AS VX_TaxRegistration
 from pronum_tab

 Order by DELDATE;

示例输入:

    import org.apache.spark.sql.functions._
    df.withColumn("domain",  split(df.col("email"),"[@.]")(1)).show
                **or** 
    df.withColumn("domain",  split(split(df.col("email"),"@")(1),"\\.")(0)).show

示例输出:

+---------------+
|          email|
+---------------+
|manoj@gmail.com|
|      abc@ac.in|
+---------------+

答案 1 :(得分:1)

您可以简单地使用内置function getSeatings() { $('#seating_view').append(`<div class="row" > <div class="col-md-6"> <div class="box" id="trialbox2"> <div class="box-header with-border"> <h3 class="box-title email"id ="trial-header2">Berammede rettssaker kunder</h3> <div class="box-tools pull-right"> <button type="button" class="btn btn-box-tool" data-widget="collapse"> <i class="fa fa-minus"></i> </button> <button type="button" class="btn btn-box-tool" data-widget="remove"> <i class="fa fa-times"></i></button> </div> </div> <div class="box-body"> <div class="row"> <div class="col-md-12"> <div class="table-responsive"> <div style="display:block-inline; height:185px; max-height:185px; overflow-y:auto;"> <table class="table no-margin" style="overflow-y:auto"> <thead style="display:block-inline;"> <tr> <th>Selskap</th> <th>Avholdes</th> <th>Saken gjelder</th> <th>Lenke</th> </tr> </thead> <tbody></tbody> </table> </div> </div> </div> </div> </div> </div> </div> </div>`); };功能从电子邮件地址获取您的域名。

regexp_extract
相关问题