Question

我有以下文件（比如这个方案，但更长）：

LSE           ZTX                       
    SWX         ZURN                    
LSE           ZYT
NYSE                            CGI

在每一行中都有2个单词（例如LSE ZTX），在开头，结尾处始终处于可选空格和/或制表符之间。有人可以帮助我用regexp匹配这两个单词吗？根据这个例子，我希望第一行的LSE为1美元，ZTX为2美元，SWX为1美元，ZURN为2美元，第二行等。我尝试过类似的东西：

$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;

我不知道怎么说，可能有空格或制表符（或两者都混合在一起，所以对于ex。\ t \ s \ t）

Answer 1

总是两个字，你不需要匹配整行，所以你最简单的正则表达式是：

/(\w+)\s+(\w+)/

Answer 2

如果你想只匹配两个第一个单词，最基本的是匹配任何不是空格的字符序列：

my ($word1, $word2) = $line =~ /\S+/g;

这会将$line中的前两个单词捕获到变量中（如果存在的话）。请注意，使用/g修饰符时不需要括号。如果要捕获所有现有匹配项，请使用数组。

Answer 3

\s还包括制表符，因此您的正则表达式如下所示：

$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;

第一个单词在第一组（$ 1），第二个单词在$ 2。

您可以根据需要将[A-Z]更改为更方便的内容。

以下是YAPE::Regex::Explain

的解释

The regular expression:

(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Answer 4

我认为这就是你想要的

^\s*([A-Z]+)\s+([A-Z]+)

看到它here on Regexr，你会发现第1组中第一行的第一行代码和第2组中的第二行代码。\s是一个空白字符，它包括例如空格，制表符和换行符。

在Perl中它是这样的：

($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;

我认为您正在逐行阅读文本文件，因此您不需要修饰符s和m，也不需要g。

如果代码不仅是ASCII字母，请将[A-Z]替换为\p{L}。 \p{L}是Unicode property，与每种语言中的每个字母都匹配。

Answer 5

选择“Multiline”这个正则表达式：

^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$

将为您提供N个匹配，每个匹配包含2个名为的组： - word1 - word2

Answer 6

^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$

这是做什么

^             // Matches the beginning of a string
\s*           // Matches a space/tab character zero or more times
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+           // Then matches at least one tab or space
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $2
$             // Matches the end of a string

Answer 7

您可以在此处使用split：

use strict;
use warnings;

while (<DATA>) {
    my ( $word1, $word2 ) = split;
    print "($word1, $word2)\n";
}

__DATA__
LSE         ZTX                       
    SWX         ZURN                    
LSE         ZYT
NYSE                            CGI

输出：

(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)

Answer 8

假设行开头的空格是您用来标识所需代码的地方，请尝试以下操作：

在换行符处拆分你的字符串，然后试试这个正则表达式：

^\s+(\w+\s+){2}$

这只匹配以某个空格开头的行，然后是（单词 - 一些空格 - 单词），然后以一些空格结束。

# ^           --> String start
# \s+         --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $           --> String end.

然而，如果您想单独捕获代码，try this：

$line =~ /^\s*(\w+)\s+(\w+)/;

# \s*   --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+   --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),

Answer 9

这将匹配您的所有代码

/[A-Z]+/

我找不到合适的正则表达式

9 个答案: