Scala Regex帮助UCI数据集

时间:2015-10-13 06:37:29

标签: regex scala

嗨,大家好我试图使用scala regex解析http://kdd.ics.uci.edu/databases/20newsgroups/20_newsgroups.tar.gz中的一些数据

下面是我试图处理的文字:

val inputData = ""xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
newsgroups: alt.atheism,soc.motss,rec.scouting
path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!watson.ibm.com!strom
from: strom@watson.ibm.com (rob strom)
subject: re: [soc.motss, et al.] "princeton axes matching funds for boy scouts"
sender: @watson.ibm.com
message-id: <1993apr05.180116.43346@watson.ibm.com>
date: mon, 05 apr 93 18:01:16 gmt
distribution: usa
references: <c47efs.3q47@austin.ibm.com> <1993mar22.033150.17345@cbnewsl.cb.att.com> <n4hy.93apr5120934@harder.ccr-p.ida.org>
organization: ibm research
lines: 15

in article <n4hy.93apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (bob mcgwier) writes:

|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.  


|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.  

can somebody reconcile the apparent contradiction between [1] and [2]?

-- 
rob strom, strom@watson.ibm.com, (914) 784-7641
ibm research, 30 saw mill river road, p.o. box 704, yorktown heights, ny  10598"

这是我需要的输出

in article <n4hy.93apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (bob mcgwier) writes:

|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.  


|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.  

can somebody reconcile the apparent contradiction between [1] and [2]?

这是我尝试的内容:

val docParser = """([\\s\\S]+\\lines: \\d*)([\\s\\S]*\\n\\n)([\\s\\S]*)""".r
val docParser(metadata, content, footer) = inputText

但我得到以下错误:

scala.MatchError:[Ljava.lang.String; @ 62f8fff1(类[Ljava.lang.String;]

在线正则表达式构建器似乎可以正常工作: Online regex builder seems to work

有什么想法吗? :)

1 个答案:

答案 0 :(得分:0)

我以前从未在scala中编程,但是从我在http://www.tutorialspoint.com/scala/scala_regular_expressions.htm中看到的内容 你必须逃脱两次像数字这样的东西。

所以\d会在scala中成为\\d,依此类推。