我想提取作为123456_Letters_CAPITAL_name_extension
形式的foldername的一部分。
名称可以是LETTERS123
,LETTERS_123_LETTERS
或LETTERS_LETTERS_123
。
目前,我正在使用unlist(strsplit(foldername , sep="_"))[4,length(unlist(strsplit(foldername , sep="_")))-1]
但是如果_CAPITAL
部分不存在我希望能够提取它(它将是3而不是4但我希望有一般的做法它)。
130615_Screen_II_SN_KB_3_lxb/
,130615_Screen_II_AL343_lxb/
,130615_Screen_II_HK_344_LM_lxb/
是完整foldername的代表性示例
我试过但是无法想出任何会这样做的正则表达式。任何想法都会有所帮助。
答案 0 :(得分:2)
这个怎么样:
^\d+_[a-zA-Z]+_(?:[A-Z]+_)?([A-Z]+\w+)_[^_]+$
该名称将在第1组中。
用于测试它的perl方法:
my $re = qr~^\d+_[a-zA-Z]+_(?:[A-Z]+_)?([A-Z]+\w+)_[^_]+$~;
while(<DATA>) {
chomp;
say $1 if /$re/;
}
__DATA__
130615_Screen_II_SN_KB_3_lxb/
130615_Screen_II_AL343_lxb/
130615_Screen_II_HK_344_LM_lxb/
130615_Screen_HK_344_LM_lxb/
<强>输出:强>
SN_KB_3
AL343
HK_344_LM
HK_344_LM
<强>解释强>
The regular expression:
^\d+_[a-zA-Z]+_(?:[A-Z]+_)?([A-Z]+\w+)_[^_]+$
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most amount
possible))
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
[^_]+ any character except: '_' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
答案 1 :(得分:1)
让我们一步一步解决问题。 以下x涵盖了所述的所有可能情况。 (OP请确认)
x <- c("123456_Letters_CAPITAL_LETTERS123_extension/", "123456_Letters_CAPITAL_LETTERS_123_LETTERS_extension/", "123456_Letters_CAPITAL_LETTERS_LETTERS_123_extension/",
"123456_Letters_LETTERS123_extension/", "123456_Letters_LETTERS_123_LETTERS_extension/", "123456_Letters_LETTERS_LETTERS_123_extension/")
# Lets strip out the parts which we can first..
y <- gsub("[0-9]+_[A-Z]+[a-z]*_(.*)_[a-z]+/", "\\1", x)
y
## [1] "CAPITAL_LETTERS123" "CAPITAL_LETTERS_123_LETTERS" "CAPITAL_LETTERS_LETTERS_123" "LETTERS123"
## [5] "LETTERS_123_LETTERS" "LETTERS_LETTERS_123"
#Now we can see that if you have 3 or 1 underscore
#you need to strip out first part
ifelse(sapply(gregexpr("_", y), FUN = function(X) length(X[X != -1])) %in% c(1, 3), gsub("[A-Z]+_(.*)", "\\1", y), y)
## [1] "LETTERS123" "LETTERS_123_LETTERS" "LETTERS_LETTERS_123" "LETTERS123" "LETTERS_123_LETTERS"
## [6] "LETTERS_LETTERS_123"