Question

LabeledDocument = Row("id", "text", "label")
sc.parallelize([(0, "a b c d e spark", 1.0),
                 (1, "b d", 0.0),
                 (2, "spark f g h", 1.0),
                 (3, "hadoop mapreduce", 0.0)]) \
    .map(lambda x: LabeledDocument(*x)).first()

此代码将提供类似

的输出

行（id = 0，text ='a b c d e spark'，label = 1.0）

但如果省略lambda表达式中的*，即

sc.parallelize([(0, "a b c d e spark", 1.0),
                     (1, "b d", 0.0),
                     (2, "spark f g h", 1.0),
                     (3, "hadoop mapreduce", 0.0)]) \
        .map(lambda x: LabeledDocument(x)).first()

您将获得输出

行（id =（0，'a b c d e spark'，1.0））

有人可以告诉我*如何分离元组并分配给Row的每一列？

Answer 1

x = [1, 2, 3]
print(x)
# => [1, 2, 3]
print(*x)
# => 1 2 3 # equivalent to print(1, 2, 3)

*x会将列表（或元组）x分发到不同的参数中。

以同样的方式，LabeledDocument(x)等于LabeledDocument((0, "a b c d e spark", 1.0))（带有一个元组参数）;但是LabeledDocument(*x)等于LabeledDocument(0, "a b c d e spark", 1.0)（有三个参数：数字，字符串和数字）。

在Ruby中，它被称为＆＃34; splat＆＃34;，因为星号（*）看起来像一个splat，并且它将列表splats到函数参数中，反之亦然。在Python社区中，我不确定它是否具有商定的名称。

详细阅读Python docs。

spark中的lambda表达式中的*是什么意思？

1 个答案: