Question

我正在从文本文件构建RDD。有些行不符合我期望的格式，在这种情况下我使用标记-1。

var spawn = require('child_process').spawn;
var child = spawn('main.exe');
child.stdout.on('data', (data) => {
  console.log(data);
  child.stdout.end('12 34 56');
});
child.on('close', (code) => console.log('Exit code: ' + code));

是否可以删除带有def myParser(line): try: # do something except: return (-1, -1), -1 lines = sc.textFile('path_to_file') pairs = lines.map(myParser)标记的行？如果没有，它的解决方法是什么？

Answer 1

我能想到的最干净的解决方案是使用flatMap丢弃格式错误的行：

def myParser(line):
    try:
        # do something
        return [result] # where result is the value you want to return
    except:
        return []

sc.textFile('path_to_file').flatMap(myParser)

另见What is the equivalent to scala.util.Try in pyspark?

您也可以在map：

之后进行过滤

pairs = lines.map(myParser).filter(lambda x: x != ((-1, -1), -1))

从Spark RDD中删除元素

1 个答案: