从输入数据中删除RDD中的重复字段

时间:2017-04-13 00:41:02

标签: apache-spark pyspark rdd

我的输入csv数据,有些行包含重复的字段或一些缺少的字段,从这些数据中我想从每行中删除重复的字段,然后所有的行都应该包含所有的字段,值为NULL就是它不在哪里包含字段。

2 个答案:

答案 0 :(得分:0)

试试这个:

# We add two new columns
Dt_Frame["ASetIDs"] = A
Dt_Frame["BSetIDs"] = B

# We need to transpose since apply operates on columns
Dt_Frame[["ASetIDs", "BSetIDs"]].T.apply(lambda x: x.ASetIDs.symmetric_difference(x.BSetIDs))

查找键:

def transform(line):
    """
    >>> s = 'id:111|name:dave|age:33|city:london'
    >>> transform(s)
    ('id:111', {'age': '33', 'name': 'dave', 'city': 'london'})
    """
    bits = line.split("|")
    key = bits[0]
    pairs = [v.split(":") for v in bits[1:]]
    return key, {kv[0].strip(): kv[1].strip() for kv in pairs if len(kv) == 2}

rdd = (sc
    .textFile("/tmp/sample")
    .map(transform))

创建数据框:

from operator import attrgetter

keys = rdd.values().flatMap(lambda d: d.keys()).distinct().collect()

并展开:

df = rdd.toDF(["id", "map"])

答案 1 :(得分:0)

所以我假设你已经从文本文件中int id = 12345; int RAVG = 1293; prices.avg = RAVG.ToString(); priceId.itemId = id; priceId.prices.Add(prices); pricesRoot.pricesList.Add(priceId); 了。我在这里创建一个:

rdd

我只是创建了将rdd = spark.sparkContext.parallelize([(u'id:111', u'name:dave', u'dept:marketing', u'age:33', u'city:london'), (u'id:123', u'name:jhon', u'dept:hr', u'city:newyork'), (u'id:100', u'name:peter', u'dept:marketing', u'name:peter', u'age:30', u'city:london'), (u'id:222', u'name:smith', u'dept:finance', u'city:boston'), (u'id:234', u'name:peter', u'dept:service', u'name:peter', u'dept:service', u'age:32', u'city:richmond')]) 映射到rddkey对的功能,并删除了重复的

value

输出前5行的示例

from pyspark.sql import Row
from pyspark.sql.types import *

def split_to_dict(l):
    l = list(set(l)) # drop duplicate here
    kv_list = []
    for e in l:
        k, v = e.split(':')
        kv_list.append({'key': k, 'value': v})
    return kv_list

rdd_map = rdd.flatMap(lambda l: split_to_dict(l)).map(lambda x: Row(**x))
df = rdd_map.toDF()