Question

我无法理解以下代码行中的正则表达式是什么意思。

author = "10_1 A Kumar; Ahmed Hemani ; Johnny &Ouml;berg<"

# after some experiment, it looks like this line captures whatever is in
# front of the underscore.
authodid =  sub("_.*","",author)

# this line extracts the number after the underscore, but I don't know 
# how this is achieved
paperno <- sub(".*_(\\w*)\\s.*", "\\1", author)

# this line extracts the string after the numbers
# I also have no idea how this is achieved through the code
coauthor <- gsub("<","",sub("^.*?\\s","", author))

我在网上看到第一个参数是模式，第二个是替换，第三个是要操作的对象。我还看到了一些关于SO的文章，并了解到\\w表示一个单词而\\s是一个空格。

但是，有些事情仍然不清楚。 \\w表示单词，是否代表下一个单词？如果没有，我该如何解释呢？我了解到^匹配字符串的开头，但^之后的时间段呢？

更重要的是，对_.*的解释是什么？.*_ ^.*?\\s怎么样？我应该怎么看？

谢谢！

Answer 1

好。有很多问题。首先要做的事情。

sub("_.*","",author)在此之后查找_以及其他所有内容。因此，在您的情况下_.*对应_1 A Kumar; Ahmed Hemani ; Johnny Öberg<。函数sub将其复制为''（因此，事实上它删除了它），因此您最终得到10。

sub(".*_(\\w*)\\s.*", "\\1", author)更棘手（没有任何理由）。它不提取任何东西。如果您将代码替换为sub(".*_(\\w*)\\s.*", "222", author)，则结果为222（而不是1）。所以无论你在第二个论证中放置什么，你都会得到结果。为什么会这样？好吧，因为".*_(\\w*)\\s.*"对应整个字符串，即：.*_对应10_; (\\w*)对应1，最后\\s.*表示空格及其后的所有内容（因此，字符串的其余部分）。

gsub("<","",sub("^.*?\\s","", author))有两个功能。第一个sub("^.*?\\s","", author)。从一开始到太空，它看起来都很棒。因此^.*?\\s代表10_1并删除它。所以，你最终得到了A Kumar; Ahmed Hemani ; Johnny Öberg<。第二个删除'＆lt;'来自各地。

我希望它有所帮助。

解析r

1 个答案: