Question

假设我们有下一个样本数据：

1,John,Martinez,North Lauderdale,20160101,1
2,John,Martinez,Plantation,20170101,2
3,John,Martinez,North Lauderdale,20161022,1
4,John,Martinez,Pembroke Pines,20181231,0
5,John,Martinez,Plantation,20190101,3
6,John,Martinez,Plantation,20200101,1
7,John,Martinez,Plantation,20210101,9

我想检查示例文件中每行的最后一个值，例如1,2,3,0,3,1,9。

def func(input):
    if str(input[5]) is "1":
        rdd_trdln = input.map(lambda line: (line, "A"))
    else:
        rdd_trdln = input.map(lambda line: (line, "O"))
        return rdd_trdln
input = sc.textFile("file.txt").map(lambda line: line.split('\t'))
return_FirstFunc = input.map(firstFunc)

我得到的错误：

AttributeError：'list'对象没有属性'map'

Answer 1

Spark RDD.map()和常规Python map()函数之间存在差异。

当你有sc.textFile("file.txt").map(lambda line: line.split('\t'))时，你创建了一个Python列表的RDD。因此，当您致电input.map(func)时，func需要接受列表，而不是RDD。

因此，input.map是您的错误......

'list'对象没有属性'map'

这是一个Python错误，而不是Spark错误。

如果您只想在列表中添加一个字符，那么您的代码将是

def func(input):
    if input[5] == "1":
        input.append("A")
    else:
        input.append("O")
    return input

或者，更多pythonic

def func(input):
    input.append("A" if input[5] == "1" else "O")
    return input

或者您可以定义您的函数以将整行作为字符串并在其中分割。
拥有列表的RDD变得混乱，有时难以记住。

def convert_func(line):
    """
    This is not returning an RDD. It returns a Python string
    """
    splits = line.split(',') # Your lines are not tab-delimited
    splits.append("A" if splits[5] == "1" else "O")
    return ",".join(splits)

lines = sc.textFile("file.txt")
converted_lines = lines.map(convert_func)

你可以这样测试

for line in converted_lines.collect():
    print(line)

Answer 2

从纯Python方面来说，如果要映射标准list，可以使用built-in map function：

input = map(lambda line: line.split('\t'), sc.textFile("file.txt"))

请注意，Python3为Python 2（列表）生成不同的结果类型（map迭代器）。

在PySpark中，如何将RDD发送到函数以比较值并返回另一个RDD？

2 个答案: