使用PySpark

时间:2016-01-28 06:05:18

标签: apache-spark pyspark

我正在使用Spark in Python在一组字段上进行聚合。我把几个值作为输入&用它创建一个数组。假设我有10个字段&我正在创建一个包含2个元素的数组,如前9个字段,如Key& 10场作为价值&将此传递给reduceByKey方法进行聚合。

我得到的输出有数组括号,我试图消除这些括号,然后在HDFS中保存为文本文件。

这是我得到的输出

('11', '2015', '919810051877', '102', 'MOC', 'LOCAL', 'A2LL', 'H', '40336', '237', '1', '', '01', '102', '27/01/2016 01:05:36', 'hramcha'),48,0.8

我想要它,

'11', '2015', '919810051877', '102', 'MOC', 'LOCAL', 'A2LL', 'H', '40336', '237', '1', '', '01', '102', '27/01/2016 01:05:36', 'hramcha',48,0.8

这是代码。我尝试了很多东西,但无法消除括号。

    #! /usr/bin/env python

import re
import sys
import time
import decimal

from pyspark import SparkContext
from decimal import Decimal

# function to extract the fields required for aggregation from the CDR files, based on position 
def extractData(line):
    bal = line.strip()
    (MSISDN,calling_no,called_no,other_party_no,start_date,start_time,end_date,end_time,duration,call_direction,call_type,call_leg,usage_type,roam_type,
    IMSI,IMEI,technology,message_type,dialled_digits,first_LAC,first_cell_id,last_LAC,last_cell_id,Tt_file_no,Tt_date_time,SMSC_ID,GSM_CRN,IC_route,OG_route,
    carrier_code,location_routing_no,other_party_operator_id,MSC_id,MSC_circle,HOME_circle,pre_post_flag,disconnecting_party,disconnection_reason,
    internal_cause_LOC,MSRN,original_calling_no,original_called_no,A_party_country_code,ICR_operator,calling_no_port,called_no_port,translated_no,global_ref_no,
    GSM_SCF_control_of_AoC,SMS_result_code,level_of_camel_service,message_ref_no,network_ref_no,ori_method,type_of_called_subscriber,call_fwd_no,
    partial_indicator,CLI,call_identification_no,Irn_odbi_ind) = bal.split("|")

    time_s = []
    time_s = start_time.split(":")
    hr = time_s[0]

    load_date = time.strftime("%d/%m/%Y %H:%M:%S")
    load_user = "hramcha"

    return (start_date,hr,MSC_circle,HOME_circle,call_direction,call_type,call_leg,roam_type,first_cell_id,first_LAC,
    disconnection_reason,internal_cause_LOC,technology,pre_post_flag,load_date,load_user,duration)

# function to remove brackets from RDD before saving as textfile
def toCSVLine(data):
    return ','.join(str(d) for d in data)

# function to create an array of (all the fields, call_duration)
def formArray1(z):
    seg1 = z[0],z[1],z[2],z[3],z[4],z[5],z[6],z[7],z[8],z[9],z[10],z[11],z[12],z[13],z[14],z[15]
    seg2 = z[16]
    return seg1, seg2

# function to create an array and also calculate mins_of_usage
def calculateMins_of_usage(v):
    return v[0],v[1],float(v[1])/60


# create Spark Context with the master details and the application name
sc = SparkContext("spark://192.168.200.128:7077", "network_usage_HLY")

# create an RDD from the input data in HDFS
file = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/input/network/",use_unicode=False) 

# group by all the fields & aggregate (sum) the call_duration
data_aggregated = file.map(extractData).map(formArray1).reduceByKey(lambda a ,b: (int(a) + int(b)))

# join the aggregation & calculate mins_of_usage
min_of_usage = data_aggregated.map(calculateMins_of_usage)

# calling toCSVLine function to remove brackets & braces
lines = min_of_usage.map(toCSVLine)

# save the final aggregated file in HDFS 
lines.saveAsTextFile("hdfs://quickstart.cloudera/user/cloudera/output/network")

1 个答案:

答案 0 :(得分:0)

嗯,你能做的就是首先将这个数组转换为字符串,然后使用正则表达式替换这样的括号:

lines = min_of_usage.map(toCSVLine).map(lambda x: re.sub('\(|\)', '', str(x))