使用Tcl Regular Expression从段落中提取2个不同的字符串

时间:2010-05-07 08:53:57

标签: tcl

我需要提取两个不同的数字,前面有两个不同的字符串。 Employee Id--> Employee16(我需要16)和 Employee links--> Employee links:2(我需要2)。 Source String如下所示:

Employee16, Employee name is QueenRose
  Working for 46w0d
  Billing is Distributed
  65537 assigned tasks, 0 reordered, 0 unassigned
  0 discarded, 0 lost received, 5/255 load
  received sequence unavailable, 0xC2E7 sent sequence
  Employee links: 2 active, 0 inactive (max not set, min not set)
    Dt3/5/10:0, since 46w0d, no tasks pending
    Dt3/5/10:10, since 21w0d, no tasks rcvd
 Employee is currently working in Hardware section.

Employee19, Employee name is Edward11
  Working  for 48w4d
  Billing is Distributed
  206801498 assigned tasks, 0 reordered, 0 unassigned
  655372 discarded, 0 lost received, 9/255 load
  received sequence unavailable, 0x23CA sent sequence
  Employee links: 7 active, 0 inactive (max not set, min not set)
    Dt3/5/10:0, since 47w2d, tasks pending
    Dt3/5/10:10, since 28w6d, no tasks pending
    Dt3/5/10:11, since 18w4d, no tasks pending
    Dt3/5/10:12, since 18w4d, no tasks pending
    Dt3/5/10:13, since 18w4d, no tasks pending
    Dt3/5/10:14, since 18w4d, no tasks pending
    Dt3/5/10:15, since 7w2d, no tasks pending
   Employee is currently working in Hardware sectione.

Employee6 (inactive)
  Employee links: 2
    Dt3/5/10:0 (inactive)
    Dt3/5/10:10 (inactive)

Employee7 (inactive)
  Employee links: 2
    Dt3/5/10:0 (inactive)
    Dt3/5/10:10 (inactive)

尝试以下内容:

Employee(\d+)[^\n\r]*[^M]*Employee links:\s+(\d+)

期望输出如下:

16  2
19  7
 6  2
 7  2

但是没有列出所有的ID和链接。 有人可以帮我解决这个问题吗?

2 个答案:

答案 0 :(得分:2)

最简单的方法是从两个不同的位置提取两个独立的匹配步骤。如果您将整个文本首先拆分为段落,那么到目前为止最简单。

  

Employee Id--> Employee16(我需要16)

我会像这样提取一个:

regexp -line {^Employee(\d+),} $paragraph -> employeeNumber

(您希望此任务的行匹配模式,而不是默认的“整个字符串”匹配模式。)

  

Employee links--> Employee links:2(我需要2)

对于这个,再次假设我们只关注单个员工的整体记录:

regexp -line {^\s+Employee links:\s*(\d+)(.*)$} $paragraph -> links rest

在这种情况下,我不仅提取了$links,还提取了该行的$rest,因为您似乎可能需要考虑是否重要。当然,可能以下内容更有用:

regexp -line {^\s+Employee links:\s*(\d+)(?:\s+active,\s+(\d+)\s+inactive)?} \
        $paragraph -> activeLinks inactiveLinks

在这种情况下,如果只存在第一个数字,$inactiveLinks将有一个空字符串(这似乎发生在员工处于非活动状态时;你需要做一些微不足道的逻辑来整理在那种情况下)。

最后,在使用regexp时,不要忘记检查结果是否匹配! 希望这会有所帮助。

答案 1 :(得分:0)

我打算提供一个完整的答案,但后来我读了Donal更多有用的教程,觉得我不能。我将展示如何将文本拆分成段落:

foreach paragraph [regexp -all -inline {.*?\n{2,}} $text] {
    do something with $paragraph
}

在您的尝试中,我看到[^\n\r]* - 您确定您的文字和换行符中是否有回车符?