所以我正在尝试使用Python(Pyspark)学习Spark。我想知道函数mapPartitions
是如何工作的。这就是输入它所带来的输出和输出。我无法从互联网上找到任何正确的例子。可以说,我有一个包含列表的RDD对象,如下所示。
[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ]
我想从所有列表中删除元素2,如何使用mapPartitions
实现该目标。
答案 0 :(得分:24)
使用yield
语法将mapPartitions与生成器函数一起使用会更容易:
def filter_out_2(partition):
for element in partition:
if element != 2:
yield element
filtered_lists = data.mapPartitions(filter_out_2)
答案 1 :(得分:22)
mapPartition应该被视为对分区的映射操作,而不是分区元素的映射操作。它的输入是当前分区的集合,其输出将是另一组分区。
您传递的地图功能必须采用RDD的单个元素
传递mapPartition的函数必须采用RDD类型的迭代,并返回并迭代其他类型或相同类型。
在您的情况下,您可能只想做类似
的事情def filterOut2(line):
return [x for x in line if x != 2]
filtered_lists = data.map(filterOut2)
如果你想使用mapPartition,它将是
def filterOut2FromPartion(list_of_lists):
final_iterator = []
for sub_list in list_of_lists:
final_iterator.append( [x for x in sub_list if x != 2])
return iter(final_iterator)
filtered_lists = data.mapPartition(filterOut2FromPartion)
答案 2 :(得分:0)
需要最终的迭代器
def filter_out_2(partition):
for element in partition:
sec_iterator = []
for i in element:
if i!= 2:
sec_iterator.append(i)
yield sec_iterator
filtered_lists = data.mapPartitions(filter_out_2)
for i in filtered_lists.collect(): print(i)
答案 3 :(得分:-1)
using System;
using System.Collections.Generic;
using System.Linq.Expressions;
using System.Reflection;
using Microsoft.CSharp.RuntimeBinder;
public static class ReflectionHelpers
{
[ThreadStatic]
static readonly Dictionary<KeyValuePair<Type, Type>, bool> ImplicitCastCache;
/// <summary>Returns true iff casting between values of the specified
/// types is possible based on the rules of C#.</summary>
public static bool IsImplicitlyCastableTo(this Type from, Type to)
{
if (from == to)
return true;
var key = new KeyValuePair<Type, Type>(from, to);
ImplicitCastCache ??= new Dictionary<KeyValuePair<Type, Type>, bool>();
if (ImplicitCastCache.TryGetValue(key, out bool result))
return result;
if (to.IsAssignableFrom(from))
return ImplicitCastCache[key] = true;
var method = GetMethodInfo(() => IsImplicitlyCastableCore<int, int>())
.GetGenericMethodDefinition().MakeGenericMethod(from, to);
return ImplicitCastCache[key] = (bool)method.Invoke(null, Array.Empty<object>());
}
static bool IsImplicitlyCastableCore<TFrom,TTo>()
{
var testObject = new LinkedListNode<TTo>(default(TTo));
try {
((dynamic)testObject).Value = default(TFrom);
return true;
} catch (Exception e) {
// e.g. "Cannot implicitly convert type 'A' to 'B'. An explicit conversion exists (are you missing a cast?)"
// The exception may be caused either because no conversion is available,
// OR because it IS available but the conversion method threw something.
// Assume RuntimeBinderException means the conversion does not exist.
return !(e is RuntimeBinderException);
}
}
/// <summary><c>GetMethodInfo(() => M(args))</c> gets the MethodInfo object corresponding to M.</summary>
public static MethodInfo GetMethodInfo(Expression<Action> shape) => ((MethodCallExpression)shape.Body).Method;
}
在上面的代码中,我能够从第二个for..in循环中获取数据。 根据生成器,一旦遍历循环,就不应赋值