我的输入csv数据,有些行包含重复的字段或一些缺少的字段,从这些数据中我想从每行中删除重复的字段,然后所有的行都应该包含所有的字段,值为NULL就是它不在哪里包含字段。
答案 0 :(得分:0)
试试这个:
# We add two new columns
Dt_Frame["ASetIDs"] = A
Dt_Frame["BSetIDs"] = B
# We need to transpose since apply operates on columns
Dt_Frame[["ASetIDs", "BSetIDs"]].T.apply(lambda x: x.ASetIDs.symmetric_difference(x.BSetIDs))
查找键:
def transform(line):
"""
>>> s = 'id:111|name:dave|age:33|city:london'
>>> transform(s)
('id:111', {'age': '33', 'name': 'dave', 'city': 'london'})
"""
bits = line.split("|")
key = bits[0]
pairs = [v.split(":") for v in bits[1:]]
return key, {kv[0].strip(): kv[1].strip() for kv in pairs if len(kv) == 2}
rdd = (sc
.textFile("/tmp/sample")
.map(transform))
创建数据框:
from operator import attrgetter
keys = rdd.values().flatMap(lambda d: d.keys()).distinct().collect()
并展开:
df = rdd.toDF(["id", "map"])
答案 1 :(得分:0)
所以我假设你已经从文本文件中int id = 12345;
int RAVG = 1293;
prices.avg = RAVG.ToString();
priceId.itemId = id;
priceId.prices.Add(prices);
pricesRoot.pricesList.Add(priceId);
了。我在这里创建一个:
rdd
我只是创建了将rdd = spark.sparkContext.parallelize([(u'id:111', u'name:dave', u'dept:marketing', u'age:33', u'city:london'),
(u'id:123', u'name:jhon', u'dept:hr', u'city:newyork'),
(u'id:100', u'name:peter', u'dept:marketing', u'name:peter', u'age:30', u'city:london'),
(u'id:222', u'name:smith', u'dept:finance', u'city:boston'),
(u'id:234', u'name:peter', u'dept:service', u'name:peter', u'dept:service', u'age:32', u'city:richmond')])
映射到rdd
和key
对的功能,并删除了重复的
value
输出前5行的示例
from pyspark.sql import Row
from pyspark.sql.types import *
def split_to_dict(l):
l = list(set(l)) # drop duplicate here
kv_list = []
for e in l:
k, v = e.split(':')
kv_list.append({'key': k, 'value': v})
return kv_list
rdd_map = rdd.flatMap(lambda l: split_to_dict(l)).map(lambda x: Row(**x))
df = rdd_map.toDF()