Question

我开始学习scala和Apache Spark。我有一个下面没有标题的输入文件。

0,name1,33,385 - first record

1,name2,26,221 - second record

unique-id, name, age, friends

1）尝试过滤年龄不等于26的年龄时，以下代码不起作用。

def parseLine(x : String) =
  {
    val line = x.split(",").filter(x => x._2 != "26")

  }

我也尝试如下。两种情况都将打印所有值，包括26

val friends = line(2).filter(x => x != "26")

2）尝试使用索引x._3时，它表示索引出站。

val line = x.split(",").filter(x => x._3 != "221")

为什么索引3在这里有问题？

请在下面找到完整的示例代码。

package learning

import org.apache.spark._
import org.apache.log4j._

object Test1 {
  def main(args : Array[String]): Unit =
  {

   val sc = new SparkContext("local[*]", "Test1")
   val lines = sc.textFile("D:\\SparkScala\\abcd.csv")
    Logger.getLogger("org").setLevel(Level.ERROR)
    val testres = lines.map(parseLine)
    testres.take(10).foreach(println)


  }
  def parseLine(x : String) =
  {
    val line = x.split(",").filter(x => x._2 != "33")
    //val line = x.split(",").filter(x => x._3 != "307")
    val age = line(1)
    val friends = line(3).filter(x => x != "307")
    (age,friends)

  }


}

如何在此处以简单的方式与年龄或朋友进行过滤。为什么索引3在这里不起作用

Answer 1

问题是您要在代表单行的数组上进行过滤，而不是在包含所有行的RDD上进行过滤。可能的版本如下（我还创建了一个案例类来保存来自CSV的数据）：

 [2019-10-02T09:35:27.872Z] Information: WebSocket connected to wss://localhost:5001/hubs/SampleMessageHub.
core.js:4002 ERROR Error: Error parsing handshake response: TypeError: Right-hand side of 'instanceof' is not callable
    at HubConnection.push../node_modules/@aspnet/signalr/dist/esm/HubConnection.js.HubConnection.processHandshakeResponse (HubConnection.js:372)
    at HubConnection.push../node_modules/@aspnet/signalr/dist/esm/HubConnection.js.HubConnection.processIncomingData (HubConnection.js:322)
    at WebSocketTransport.HubConnection.connection.onreceive (HubConnection.js:65)
    at WebSocket.webSocket.onmessage (WebSocketTransport.js:107)
    at WebSocket.wrapFn (zone.js:1279)
    at ZoneDelegate.invokeTask (zone.js:431)
    at Object.onInvokeTask (core.js:26246)
    at ZoneDelegate.invokeTask (zone.js:430)
    at Zone.runTask (zone.js:198)
    at ZoneTask.invokeTask [as invoke] (zone.js:513)

另一种可能性是改用package learning import org.apache.spark._ import org.apache.log4j._ object Test2 { // A structured representation of a CSV line case class Person(id: String, name: String, age: Int, friends: Int) def main(args : Array[String]): Unit = { val sc = new SparkContext("local[*]", "Test1") Logger.getLogger("org").setLevel(Level.ERROR) sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String] .map(line => parse(line)) // RDD[Person] .filter(person => person.age != 26) // filter out people of 26 years old .take(10) // collect 10 people from the RDD .foreach(println) } def parse(x : String): Person = { // Split the CSV string by comma into an array of strings val line = x.split(",") // After extracting the fields from the CSV string, create an instance of Person Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt) } }和flatMap()值。在这种情况下，您可以直接在一行上操作，例如：

Option[]

在rdd

1 个答案: