Question

此myRDD的{{1}}数据：

rows

如何将[u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham']替换为|，以便我可以构建,。有没有更好的方法来构建具有此类数据的dataframe。？

Answer 1

>>> data = [u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham']
>>> data = [item.replace("|", ",") for item in data]
>>> data
['#fields:excDate,schedDate,TZ,custID,muID,tvID,acdID,logonID,agentName,modify,exception,start,stop,LS Oracle Emp ID,Team Lead', '06152016,06152016,CET,3,3000,1688,87,,Ali, AbdElaziz,1465812004,Open,08:00,09:00,101021021,ElDeleify,Hisham']

Answer 2

根据spark doc on createDataFrame创建框架的一种方法是将数据作为列表列表和标题作为列表传递。

data = [u'#fields:excDate|schedDate|TZ|custID|muID|tvID|acdID|logonID|agentName|modify|exception|start|stop|LS Oracle Emp ID|Team Lead', u'06152016|06152016|CET|3|3000|1688|87||Ali, AbdElaziz|1465812004|Open|08:00|09:00|101021021|ElDeleify,Hisham']

data = [d.split("|") for d in data] #creating a list of list 

shema = data[0] # the first row of the data is the in reality the schema
data = data[1:] # remove the schema from the data
schema[0] =schema[0].split(":",1)[1] #to remove the #fields: of the first header
dataframe = sqlContext.createDataFrame(data,schema)

Answer 3

它甚至不需要for循环，假设你的字符串被称为'data'：

data[0] = data[0].replace('|',',')

在一行中做得很好，很容易。

使用python在spark中用逗号（，）符号删除替换管道（|）符号

3 个答案: