User_Info:Array[String]=("Brian McNamee (Canada) 16th October 2015", "Claudia Stanzani 18th September 2009", ..)
这就是我的意图:
Expecting Output: Array[(String,String,String)]=Array(("Brian McNamee", "Canada", "16th October 2015"),("Claudia Stanzani", "", "18th September 2009")
我的尝试方式:
val pattern="(.+)\\((.+)\\)(.+)".r //pattern variable accepts all the RDDs that contain (<country>)
val default_pattern="(.+)\\s(.+)".r //default pattern variable marking the place country column column empty
val User_profiles= User_Info.map{
| case pattern(name, country, year) => (name, country, year)
| case default_pattern(name, country, year) =>(name, "", year)}
但是这导致我的字符串数组的正则表达式模式不服从:
数组((Brian McNamee(加拿大)10月16日,“”,2015),(“Claudia Stanzani 9月18日“,”“,”2009“)
真正出错的是,是因为正则表达式定义错误还是模式匹配?或两者? =)
答案 0 :(得分:1)
default_pattern
有两个问题。
case default_pattern(name, country, year)
永远不会匹配。这将有效:case default_pattern(name, year)
但...... name
结束位置和year
(即日期信息)开始的规则。当前模式将所有内容放入name
,除了最后以空格分隔的单词。您实际上根本不需要default_pattern
,但pattern
会变得有点臃肿。
val pattern=
"""(?x) # allow regex comments, ignore whitespace
([^\d(]+) # name, no digits or "("
(:?\((\D+)\)\s*)? # (country), optional, no digits
(\d\S+)\s+ # day, starts with digit, no spaces
(\S+)\s+ # month, no spaces
(\d+) # year, digits only
""".r
User_Info.map{
case pattern(name, _, country, day, month, year) =>
(name.trim, Option(country).getOrElse(""), s"$day $month $year")
case _ => throw new Error
}
答案 1 :(得分:1)
对于默认匹配情况,跳过匹配>> from universal import fb_acc
>> usr = fb_acc.usr
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
usr = fb_acc.usr
AttributeError: 'function' object has no attribute 'usr'
,而是尝试匹配日期(可能)以country
开头(例如15日,2日等),如下所示:
day
请注意,如果需要,日期的正则表达式匹配可能更严格(例如,val User_Info: Array[String] = Array(
"Brian McNamee (Canada) 16th October 2015", "Claudia Stanzani 18th September 2009"
)
val pattern="""(.*?)\s*\((.*)\)\s*(.*)""".r
val default_pattern="""(.*?)\s*(\d+st|\d+nd|\d+rd|\d+th)(.*)""".r
val User_profiles = User_Info.map{
case pattern(name, country, year) => (name, country, year)
case default_pattern(name, day, monthyear) => (name, "", day + monthyear)
}
// User_profiles: Array[(String, String, String)] = Array(
// (Brian McNamee,Canada,16th October 2015), (Claudia Stanzani,"",18th September 2009)
// )
的限制数字为1到2位数,使day
恰好是12个月中的一个和month
为4位数等。)