基于输入字符串数组上的正则表达式生成字符串数组

时间:2018-04-16 23:58:04

标签: scala

User_Info:Array[String]=("Brian McNamee (Canada) 16th October 2015",    "Claudia Stanzani 18th September 2009", ..)

这就是我的意图:

Expecting Output: Array[(String,String,String)]=Array(("Brian McNamee", "Canada", "16th October 2015"),("Claudia Stanzani", "", "18th September 2009")

我的尝试方式:

val pattern="(.+)\\((.+)\\)(.+)".r //pattern variable accepts all the RDDs that contain (<country>)
val default_pattern="(.+)\\s(.+)".r //default pattern variable marking the place country column column empty


 val User_profiles= User_Info.map{
         | case pattern(name, country, year) => (name, country, year)
         | case default_pattern(name, country, year) =>(name, "", year)}

但是这导致我的字符串数组的正则表达式模式不服从:

  

数组((Brian McNamee(加拿大)10月16日,“”,2015),(“Claudia Stanzani   9月18日“,”“,”2009“)

真正出错的是,是因为正则表达式定义错误还是模式匹配?或两者? =)

2 个答案:

答案 0 :(得分:1)

default_pattern有两个问题。

  1. 它有2个捕获组,因此case default_pattern(name, country, year)永远不会匹配。这将有效:case default_pattern(name, year)但......
  2. 没有确定name结束位置和year(即日期信息)开始的规则。当前模式将所有内容放入name,除了最后以空格分隔的单词。
  3. 您实际上根本不需要default_pattern,但pattern会变得有点臃肿。

    val pattern=
      """(?x)              # allow regex comments, ignore whitespace
         ([^\d(]+)         # name, no digits or "("
         (:?\((\D+)\)\s*)? # (country), optional, no digits
         (\d\S+)\s+        # day, starts with digit, no spaces
         (\S+)\s+          # month, no spaces
         (\d+)             # year, digits only
      """.r
    
    User_Info.map{
       case pattern(name, _, country, day, month, year) =>
         (name.trim, Option(country).getOrElse(""), s"$day $month $year")
       case _ => throw new Error
    }
    

答案 1 :(得分:1)

对于默认匹配情况,跳过匹配>> from universal import fb_acc >> usr = fb_acc.usr Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> usr = fb_acc.usr AttributeError: 'function' object has no attribute 'usr' ,而是尝试匹配日期(可能)以country开头(例如15日,2日等),如下所示:

day

请注意,如果需要,日期的正则表达式匹配可能更严格(例如,val User_Info: Array[String] = Array( "Brian McNamee (Canada) 16th October 2015", "Claudia Stanzani 18th September 2009" ) val pattern="""(.*?)\s*\((.*)\)\s*(.*)""".r val default_pattern="""(.*?)\s*(\d+st|\d+nd|\d+rd|\d+th)(.*)""".r val User_profiles = User_Info.map{ case pattern(name, country, year) => (name, country, year) case default_pattern(name, day, monthyear) => (name, "", day + monthyear) } // User_profiles: Array[(String, String, String)] = Array( // (Brian McNamee,Canada,16th October 2015), (Claudia Stanzani,"",18th September 2009) // ) 的限制数字为1到2位数,使day恰好是12个月中的一个和month为4位数等。)