Question

我使用CombineTextInputFormat读取Spark上的许多小文件。

Java代码如下（我把它写成实用函数）：

public static JavaRDD<String> combineTextFile(JavaSparkContext sc, String path, long maxSplitSize, boolean recursive)
{
    Configuration conf = new Configuration();
    conf.setLong(CombineTextInputFormat.SPLIT_MAXSIZE, maxSplitSize);
    if (recursive)
        conf.setBoolean(CombineTextInputFormat.INPUT_DIR_RECURSIVE, true);
    return
        sc.newAPIHadoopFile(path, CombineTextInputFormat.class, LongWritable.class, Text.class, conf)
        .map(new Function<Tuple2<LongWritable, Text>, String>()
        {
            @Override
            public String call(Tuple2<LongWritable, Text> tuple) throws Exception
            {
                return tuple._2().toString();
            }
        });
}

它可以工作，但是当程序运行时，会打印以下警告：

WARN TaskSetManager: Stage 0 contains a task of very large size (159 KB). The maximum recommended task size is 100 KB.

该程序总共读取大约3.5MB，文件数为1234.这些文件位于一个目录中。

这是正常的吗？否则我怎么能摆脱这条消息？

我的Spark版本是1.3。

程序以本地模式运行。

Answer 1

除了你的问题，我没有答案，你可能想尝试一种不同的方法来处理目录下的所有文件。

Spark不仅支持处理单个文件，还支持处理整个目录。如果要查找所有文件（如您的情况），则在单个目录中，命令sc.textfile可以通过指定位置来读取其中的每个文件，例如：

sc.textfile("//my/folder/with/files");

您可以在以下问题中找到有关它的更多信息：How to read multiple text files into a single RDD?

Spark：使用CombineTextInputFormat时，任务大小很大

1 个答案: