我正在解析包含大量格式化数字的文档,例如:
Frc consts -- 1.4362 1.4362 5.4100
IR Inten -- 0.0000 0.0000 0.0000
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 1 0.40 -0.20 0.23 -0.30 -0.18 0.36 0.06 0.42 0.26
这些是分隔线,所有这些线都具有显着的前导空间,并且可能存在或可能不存在显着的尾随空格。它们由72,72,78,78和78个字符组成。我可以推断出字段之间的界限。这些是可描述的(使用fortran格式(nx = nspaces,an = n alphanum,in = n in nint,fm.n = m个字符的float,小数点后n个位置)by:
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a4,a4,3(2x,3a7))
(1x,2i4,3(2x,3f7.2))
(1x,2i4,3(2x,3f7.2))
我可能有几千种不同的格式(我可以自动生成或移出),并通过描述组件的正则表达式来描述它们。因此,如果regf10_4代表满足f10.4约束的任何字符串的正则表达式,我可以创建一个形式的正则表达式:
COMMENTS
(\s
.{14}
\s
regf10_4,
\s{13}
regf10_4,
\s{13}
regf10_4,
)
我想知道是否有正则表达式以这种方式满足重复使用。计算机和人类创造的数字有很多种,比如f10.4。我相信以下是fortran的所有合法输入和/或输出(我不需要像12.4f那样的f或d形式的后缀)[SO中的格式应该被理解为没有第一个的前导空格,一个用于第二,等等。]
-1234.5678
1234.5678
// missing number
12345678.
1.
1.0000000
1.0000
1.
0.
0.
.1234
-.1234
1E2
1.E2
1.E02
-1.0E-02
********** // number over/underflow
它们必须对相邻字段的内容具有鲁棒性(例如,只能在精确位置精确检查10个字符。因此,以下内容对于(a1,f5.2,a1)是合法的:
a-1.23b // -1.23
- 1.23. // 1.23
3 1.23- // 1.23
我正在使用Java,因此需要与Java 1.6兼容的正则表达式构造(例如,不是perl扩展)
答案 0 :(得分:2)
据我了解,每行包含一个或多个固定宽度字段,可能包含不同种类的标签,空格或数据。如果您知道字段的宽度和类型,则提取数据只需substring()
,trim()
和(可选)Whatever.parseWhatever()
。正则表达式无法让这项工作变得更容易 - 事实上,他们所能做的就是让它变得更加困难。
扫描仪也没有真正帮助。确实,它为各种值类型预定义了正则表达式,它为您进行转换,但仍需要告知每次要查找的类型,并且需要将字段用可识别的分隔符分隔。根据定义,固定宽度数据不需要分隔符。你可以通过做一个前瞻来伪造分隔符,但是行中应该留下许多字符,但这只是让工作变得比它需要的更难的另一种方式。
听起来性能将成为一个主要问题;即使你可以使正则表达式解决方案工作,它可能会太慢。不是因为正则表达本身就很慢,而是因为你必须经历的扭曲让它们适合这个问题。我建议你忘掉这份工作的正则表达式。
答案 1 :(得分:1)
你可以从这开始,然后从那里开始。
此正则表达式与您提供的所有数字相匹配 不幸的是,它也匹配 3中的 3 1.23 -
// [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)?
//
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:\.[0-9]*)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+»
// Match the character “.” literally «\.»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:[eE][-+]?[0-9]+)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character present in the list “eE” «[eE]»
// Match a single character present in the list “-+” «[-+]?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match a single character in the range between “0” and “9” «[0-9]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?");
Matcher matcher = regex.matcher(document);
while (matcher.find()) {
// matched text: matcher.group()
// match start: matcher.start()
// match end: matcher.end()
}
答案 2 :(得分:0)
这只是一个部分答案,但我在Java 1.5中被Scanner警告,它可以扫描文本并解释数字,这些数字为这个Java实用程序可以扫描和解释的数字提供了BNF。原则上我想象BNF可以用来构造一个正则表达式。