pyspark mapPartitions函数如何工作?

时间:2014-11-04 17:51:37

标签: python scala bigdata apache-spark

所以我正在尝试使用Python(Pyspark)学习Spark。我想知道函数mapPartitions是如何工作的。这就是输入它所带来的输出和输出。我无法从互联网上找到任何正确的例子。可以说,我有一个包含列表的RDD对象,如下所示。

[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ] 

我想从所有列表中删除元素2,如何使用mapPartitions实现该目标。

4 个答案:

答案 0 :(得分:24)

使用yield语法将mapPartitions与生成器函数一起使用会更容易:

def filter_out_2(partition):
    for element in partition:
        if element != 2:
            yield element

filtered_lists = data.mapPartitions(filter_out_2)

答案 1 :(得分:22)

mapPartition应该被视为对分区的映射操作,而不是分区元素的映射操作。它的输入是当前分区的集合,其输出将是另一组分区。

您传递的地图功能必须采用RDD的单个元素

传递mapPartition的函数必须采用RDD类型的迭代,并返回并迭代其他类型或相同类型。

在您的情况下,您可能只想做类似

的事情
def filterOut2(line):
    return [x for x in line if x != 2]

filtered_lists = data.map(filterOut2)

如果你想使用mapPartition,它将是

def filterOut2FromPartion(list_of_lists):
  final_iterator = []
  for sub_list in list_of_lists:
    final_iterator.append( [x for x in sub_list if x != 2])
  return iter(final_iterator)

filtered_lists = data.mapPartition(filterOut2FromPartion)

答案 2 :(得分:0)

需要最终的迭代器

def filter_out_2(partition):
for element in partition:
    sec_iterator = []
    for i in element:
        if i!= 2:
            sec_iterator.append(i)
    yield sec_iterator

filtered_lists = data.mapPartitions(filter_out_2)
for i in filtered_lists.collect(): print(i)

答案 3 :(得分:-1)

using System;
using System.Collections.Generic;
using System.Linq.Expressions;
using System.Reflection;
using Microsoft.CSharp.RuntimeBinder;

public static class ReflectionHelpers
{
    [ThreadStatic]
    static readonly Dictionary<KeyValuePair<Type, Type>, bool> ImplicitCastCache;

    /// <summary>Returns true iff casting between values of the specified 
    /// types is possible based on the rules of C#.</summary>
    public static bool IsImplicitlyCastableTo(this Type from, Type to)
    {
        if (from == to)
            return true;

        var key = new KeyValuePair<Type, Type>(from, to);
        ImplicitCastCache ??= new Dictionary<KeyValuePair<Type, Type>, bool>();
        if (ImplicitCastCache.TryGetValue(key, out bool result))
            return result;

        if (to.IsAssignableFrom(from))
            return ImplicitCastCache[key] = true;

        var method = GetMethodInfo(() => IsImplicitlyCastableCore<int, int>())
            .GetGenericMethodDefinition().MakeGenericMethod(from, to);
        return ImplicitCastCache[key] = (bool)method.Invoke(null, Array.Empty<object>());
    }

    static bool IsImplicitlyCastableCore<TFrom,TTo>()
    {
        var testObject = new LinkedListNode<TTo>(default(TTo));
        try {
            ((dynamic)testObject).Value = default(TFrom);
            return true;
        } catch (Exception e) {
            // e.g. "Cannot implicitly convert type 'A' to 'B'. An explicit conversion exists (are you missing a cast?)"
            // The exception may be caused either because no conversion is available,
            // OR because it IS available but the conversion method threw something.
            // Assume RuntimeBinderException means the conversion does not exist.
            return !(e is RuntimeBinderException); 
        }
    }

    /// <summary><c>GetMethodInfo(() => M(args))</c> gets the MethodInfo object corresponding to M.</summary>
    public static MethodInfo GetMethodInfo(Expression<Action> shape) => ((MethodCallExpression)shape.Body).Method;
}

在上面的代码中,我能够从第二个for..in循环中获取数据。 根据生成器,一旦遍历循环,就不应赋值