如何有效地解析(没有太多代码混乱)如下所示的语句? 关键字/分隔符放在[]。
中经理,德里[为]公司私人有限公司[从] 2009年1月[至] 2012年1月。
使用解析组合器从文本中提取人名,公司名称和日期范围。 (预期输出显示在底部)
以下是为上述代码编写的代码 -
case class CompanyWithMonthDateRange(company:String, position:String, dateRange:List[MonthYear])
case class MonthYear(month:String, year:Int)
object CompanyParser1 extends RegexParsers {
override type Elem = Char
override def skipWhitespace = false
def keywords: Parser[String] = "for" | "in" | "with" |"at" | "from" | "pvt"|"ltd" | "company" | "co" | "limited" | "inc" | "corporation" | "jan" |\
"feb" | "mar" | "apr" | "may" | "jun" | "jul" | "aug" | "sep" | "nov" | "dec" | "to" | "till" | "until" | "upto"
val date = ("""\d\d\d\d""".r | """\d\d""".r)
val integer = ("""(0|[1-9]\d*)""".r) ^^ { _.toInt }
val comma = ("""\,""".r)
val quote = ("""[\'\"]+""".r)
val underscore = ("""\_""".r)
val dot = ("""\.""".r)
val space = ("""\s+""".r) ^^ {case _ => ""}
val colon = (""":""".r)
val ampersand = ("""(\&|and)""".r)
val hyphen = ("""\-""".r)
val brackets = ("""[\(\)]+""".r)
val newline = ("""[\n\r]""".r)
val months = ("""(jan|feb|mar|apr|may|jun|jul|aug|sep|nov|dec)""".r)
val toTillUntil = ("""(to|till|until|upto)""".r)
val asWord = ("""(as)""".r)
val fromWord = ("""from""".r)
val forWithAt = ("""(in|for|with|at)""".r)
val companyExt = ("""(pvt|ltd|company|co|limited|inc|corporation)""".r)
val alphabets = not(keywords)~"""[a-zA-Z]+""".r
val name = not(keywords)~"""[a-zA-Z][a-zA-Z\,\-\'\&\(\)]+\s+""".r
def possibleCompanyExts = companyExt <~ (dot *) ^^ {_.toString.trim}
def alphabetsExt = ((alphabets ~ ((quote | ampersand | hyphen | brackets | underscore | comma) *) <~ (space *))+) ^^ { case a => a.toString.trim}
def companyNameExt = (alphabetsExt <~ (space *) <~ (possibleCompanyExts+)) ^^ {_.toString
}
def companyName = alphabetsExt *
def entityName = (alphabetsExt+) ^^ {case l => l.map(s => s.trim).mkString(" ")}
def dateWithEndingChars = date <~ ((comma | quote | dot | newline) *) <~ (space *) ^^ {_.toInt}
def monthWithEndingChars = months <~ ((comma | quote | dot | newline) *) <~ (space *) ^^ { _.toString}
def monthWithDate = monthWithEndingChars ~ dateWithEndingChars ^^ { case a~b => MonthYear(a,b)}
def monthDateRange = monthWithDate ~ (space *) ~ toTillUntil ~ (space *) ~ monthWithDate ^^ { case a~s1~b~s2~c => List(a,c)}
def companyWithMonthDateRange = (companyNameExt ~ (space *) ~ monthDateRange) ^^ {
case a~b~c => CompanyWithMonthDateRange(company = a, dateRange = c, position = "")
}
def positionWithCompanyWithMonthDateRange = ((name+) ~ (space *) ~ forWithAt ~ (space *) ~ companyWithMonthDateRange) ^^ {
case a~s1~b~s2~c => c.copy(position = a.mkString(","))
}
def apply(input:String) = {
parseAll(positionWithCompanyWithMonthDateRange,input) match {
case Success(lup,_) => println(lup)
case x => println(x)
}
}
}
输出应该像
CompanyWithMonthDateRange(List(((()~Company)~List()), ((()~fd)~List()), ((()~India)~List('))),(()~Manager, ),(()~Delhi ),List(MonthYear(mar,2010), MonthYear(jul,2012)))
另外,如何删除不需要的&#34;〜&#34;出现在上面的文本中。
谢谢, 爬完
答案 0 :(得分:0)
我不是想把它写成你真正问题的完整解决方案,只是将句子解析为你提供的数据结构,我不确定它是否有帮助,只是作为参考。
在CompanyWithMonthDateRange
中,我没有看到提取名称的位置,因此,我会将其删除,添加它应该是微不足道的。
object CompParser extends RegexParsers {
val For = "[for]"
val From = "[from]"
val To = "[to]"
val Keyword = For | From | To
val Def = """(?m)(?<=^|\]).*?(?=\[|(\.\s*[\n\r]+))""".r
val End = """.""".r
val Construct = opt(Def) ~ Keyword ~ Def ^^ {
case p ~ `For` ~ s => {
val arr = p.getOrElse("").split(",")
val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
("pos&com", (t2._1, s.toString))
}
case p ~ `From` ~ s => {
val arr = s split ","
val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
("from", (t2._1, t2._2))
}
case p ~ `To` ~ s => {
val arr = s split ","
val t2 = if (arr.length == 2) arr(0) -> arr(1) else ("", "")
("to", (t2._1, t2._2))
}
}
val Statement = rep(Construct) ^^ (Map() ++ _) ^^ { m =>
if (m.size == 3) {
val from = new MonthYear(m.get("from").head._1, m.get("from").head._2.trim.toInt)
val to = new MonthYear(m.get("to").head._1, m.get("to").head._2.trim.toInt)
val pos = m.get("pos&com").head._1
val com = m.get("pos&com").head._2
new Some(CompanyWithMonthDateRange(com, pos, List(from, to)))
} else None
}
val Statements = rep(Statement <~ End)
def apply(in: String) = {
parseAll(Statements, in) match {
case Success(r, i) => println(r)
case failure => failure
}
}
}
并且解析器在换行符处停止,这是解析器的测试:
object TestP extends App {
val inStr1 = """
Manager, Delhi [for] The Company Pvt Ltd. [from] Jan, 2009 [to] Jan, 2012.
"""
val inStr2 = """
Manager, Delhi [for] The Company Pvt Ltd. [from] Jan, 2009 [to] Jan, 2012.
Employee, Kate [for] The Company Pvt Ltd. [from] Feb, 2010 [to] Jun, 2012.
HR, Jane [for] The Company Pvt Ltd. [from] May, 2010 [to] July, 2012.
"""
CompParser(inStr1)
CompParser(inStr2)
}
输出是: inStr1:
列表(部分(CompanyWithMonthDateRange(The Company Pvt Ltd.) ,经理,清单(MonthYear(2009年1月),MonthYear(2012年1月)))))
inStr2:
列表(部分(CompanyWithMonthDateRange(The Company Pvt Ltd.) ,经理,清单(MonthYear(2009年1月),MonthYear(2012年1月)))), 一些(CompanyWithMonthDateRange(The Company Pvt Ltd.) ,员工,名单(MonthYear(2010年2月),MonthYear(2012年6月)))), 一些(CompanyWithMonthDateRange(The Company Pvt Ltd.) ,HR,List(MonthYear(2010年5月),MonthYear(2012年7月)))))