使用python在spark中用逗号(,)符号删除替换管道(|)符号

时间:2016-06-15 15:21:52

标签: python apache-spark

myRDD的{​​{1}}数据:

rows

如何将[u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham'] 替换为|,以便我可以构建,。 有没有更好的方法来构建具有此类数据的dataframe。 ?

3 个答案:

答案 0 :(得分:2)

>>> data = [u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham']
>>> data = [item.replace("|", ",") for item in data]
>>> data
['#fields:excDate,schedDate,TZ,custID,muID,tvID,acdID,logonID,agentName,modify,exception,start,stop,LS Oracle Emp ID,Team Lead', '06152016,06152016,CET,3,3000,1688,87,,Ali, AbdElaziz,1465812004,Open,08:00,09:00,101021021,ElDeleify,Hisham']

答案 1 :(得分:2)

根据spark doc on createDataFrame创建框架的一种方法是将数据作为列表列表和标题作为列表传递。

data = [u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham']

data = [d.split("|") for d in data] #creating a list of list 

shema = data[0] # the first row of the data is the in reality the schema
data = data[1:] # remove the schema from the data
schema[0] =schema[0].split(":",1)[1] #to remove the #fields: of the first header
dataframe = sqlContext.createDataFrame(data,schema)

答案 2 :(得分:0)

它甚至不需要for循环,假设你的字符串被称为'data':

data[0] = data[0].replace('|',',')

在一行中做得很好,很容易。