我开始学习scala和Apache Spark。我有一个下面没有标题的输入文件。
0,name1,33,385 - first record
1,name2,26,221 - second record
unique-id, name, age, friends
1)尝试过滤年龄不等于26的年龄时,以下代码不起作用。
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "26")
}
我也尝试如下。两种情况都将打印所有值,包括26
val friends = line(2).filter(x => x != "26")
2)尝试使用索引x._3时,它表示索引出站。
val line = x.split(",").filter(x => x._3 != "221")
为什么索引3在这里有问题?
请在下面找到完整的示例代码。
package learning
import org.apache.spark._
import org.apache.log4j._
object Test1 {
def main(args : Array[String]): Unit =
{
val sc = new SparkContext("local[*]", "Test1")
val lines = sc.textFile("D:\\SparkScala\\abcd.csv")
Logger.getLogger("org").setLevel(Level.ERROR)
val testres = lines.map(parseLine)
testres.take(10).foreach(println)
}
def parseLine(x : String) =
{
val line = x.split(",").filter(x => x._2 != "33")
//val line = x.split(",").filter(x => x._3 != "307")
val age = line(1)
val friends = line(3).filter(x => x != "307")
(age,friends)
}
}
如何在此处以简单的方式与年龄或朋友进行过滤。 为什么索引3在这里不起作用
答案 0 :(得分:1)
问题是您要在代表单行的数组上进行过滤,而不是在包含所有行的RDD上进行过滤。 可能的版本如下(我还创建了一个案例类来保存来自CSV的数据):
[2019-10-02T09:35:27.872Z] Information: WebSocket connected to wss://localhost:5001/hubs/SampleMessageHub.
core.js:4002 ERROR Error: Error parsing handshake response: TypeError: Right-hand side of 'instanceof' is not callable
at HubConnection.push../node_modules/@aspnet/signalr/dist/esm/HubConnection.js.HubConnection.processHandshakeResponse (HubConnection.js:372)
at HubConnection.push../node_modules/@aspnet/signalr/dist/esm/HubConnection.js.HubConnection.processIncomingData (HubConnection.js:322)
at WebSocketTransport.HubConnection.connection.onreceive (HubConnection.js:65)
at WebSocket.webSocket.onmessage (WebSocketTransport.js:107)
at WebSocket.wrapFn (zone.js:1279)
at ZoneDelegate.invokeTask (zone.js:431)
at Object.onInvokeTask (core.js:26246)
at ZoneDelegate.invokeTask (zone.js:430)
at Zone.runTask (zone.js:198)
at ZoneTask.invokeTask [as invoke] (zone.js:513)
另一种可能性是改用package learning
import org.apache.spark._
import org.apache.log4j._
object Test2 {
// A structured representation of a CSV line
case class Person(id: String, name: String, age: Int, friends: Int)
def main(args : Array[String]): Unit = {
val sc = new SparkContext("local[*]", "Test1")
Logger.getLogger("org").setLevel(Level.ERROR)
sc.textFile("D:\\SparkScala\\abcd.csv") // RDD[String]
.map(line => parse(line)) // RDD[Person]
.filter(person => person.age != 26) // filter out people of 26 years old
.take(10) // collect 10 people from the RDD
.foreach(println)
}
def parse(x : String): Person = {
// Split the CSV string by comma into an array of strings
val line = x.split(",")
// After extracting the fields from the CSV string, create an instance of Person
Person(id = line(0), name = line(1), age = line(2).toInt, friends = line(3).toInt)
}
}
和flatMap()
值。在这种情况下,您可以直接在一行上操作,例如:
Option[]