Question

我有以下示例数据

     class: 9
        section: A
             stud : Robert
                 subject: maths
                 mark  : 69
                 subject:science
                 mark: 75
        stud : Billy
                subject: maths
                mark  : 69
                subject:science
                mark: 75
         stud : Venice
                subject: maths
                mark  : 69
               subject:science
               mark: 75
        stud : Marc
               subject: maths
               mark  : 69
               subject:science
               mark: 75
    class: 10
        section: A
           stud : Agnes
                subject: maths
                mark  : 69
                subject:science
                mark: 75
           stud : Sarah
                subject: maths
                mark  : 69
                subject:science
                mark: 75
          stud : Scott
               subject: maths
              mark  : 69
              subject:science
              mark: 75
        stud : Alex
             subject: maths
             mark  : 69
             subject:science
             mark: 75
line1
line2
line3
...
line n

我正在尝试从此文件中提取第9类学生数据。这是我的代码

   val datafile = sc.textFile("file.txt").collect().mkString(" ")
    // to take the data I needed from whole file
    val datpattern = """(class: 9).*?(?=\bline\s) 
    val finaldata = datpattern.findAllIn(datafile)
    //student data extract regex
    val stupattern = "section: (\S+)\s+ stud : ([\w\S]+)\s+ subject: ([\w\S]+)\s+ mark : (\d+)"""".r


val finalresult = finaldata.flatMap { a => stupattern findAllIn a }
                           .map {l = 
                        val stupattern(section,stuname,sub,mark) = l
                        (section,stuname,sub,mark)
}
.foreach(println)

但这只给了我每个班级的第一个记录，也只是第一个主题＆amp;标记。（罗伯特数学标记和艾格尼丝数学标记来自9级和10级S部分。

我认为这是因为只有匹配整个模式。

我尝试将其更改为0或更多出现主题和＆amp;标记。类似于下面的东西（只有我在下面给出的改变的行）

val stupattern = "section: (\S+)\s+ stud : ([\w\S]+)\s+ (subject: ([\w\S]+)\s+ mark : (\d+))*"""".r


    val finalresult = finaldata.flatMap { a => stupattern findAllIn a }
                               .map {l = 
                            val stupattern(section,stuname,{sub,mark}) = l//This doesn't even let me compile
                            (section,stuname,sub,mark)//This doesn't even let me compiled
    }
    .foreach(println)

它错误地出现了那两行＆＃34;非法开始模式＆＃34;。

有人可以告诉我如何从上面提取重复的数据子集吗？在此先感谢。

使用常规exp in spark

0 个答案: