Question

我有一个R数据帧，电影名称如下：

Shawshank Redemption, The
Godfather II, The
Band of Brothers

我想将这些名称显示为：

The Shawshank Redemption
The Godfather II
Band of Brothers

任何人都可以帮忙知道如何检查数据帧的每一行，以查看上面的逗号（如上）之后是否有“The”，如果有，请将其移到句子的前面？

Answer 1

您可以使用gsub：

df$movies2 = gsub("^([\\w\\s]+),*\\s*([Tt]he*($|(?=\\s\\(\\d{4}\\))))", "\\2 \\1", df$movies, perl = TRUE)

<强>结果：

> df
                            movies                         movies2
1 Shawshank Redemption, The (1994) The Shawshank Redemption (1994)
2                Godfather II, The                The Godfather II
3                 Band of Brothers                Band of Brothers
4               Dora, The Explorer              Dora, The Explorer
5             Kill Bill Vol. 2 The            Kill Bill Vol. 2 The
6                  ,The Highlander                 ,The Highlander
7                   Happening, the                   the Happening

数据：

df = data.frame(movies = c("Shawshank Redemption, The (1994)", "Godfather II, The", "Band of Brothers", "Dora, The Explorer", "Kill Bill Vol. 2 The", ",The Highlander", "Happening, the"), stringsAsFactors = FALSE)

备注：

整个正则表达式的目标是对第一部分（,之前的部分）和第二部分（＆＃39;之后的,进行分组，并且只有在(year)之后才进行分组。 s在最后或\\2之前）进入单独的捕获组，我可以与\\1和^([\\w\\s]+)交换

,*\\s*从字符串
开始一次或多次匹配任何单词字符或空格
[Tt]he*匹配逗号和空格零次或多次

($|(?=\\s\$\\d{4}\$))匹配＆＃34;＆＃34;＆＃34;或＆＃34;＆＃34;＆＃34;零次或多次

请注意，后面跟着$，它匹配＆＃34;字符串＆＃34;，\\s\$\\d{4}\$的结尾，或者是一个正向前瞻，它检查前一个模式是否后跟{{ 1}}

\\s\$\\d{4}\$匹配空格，(4 digits) 包括括号。需要双反斜杠来逃避单个反斜杠

所以([Tt]he*($|(?=\\s\$\\d{4}\$)))匹配＆＃34;＆＃34;或＆＃34;＆＃34;＆＃34;要么在字符串的末尾，要么跟在(4 digits)
之后
括号中的所有内容都是捕获组，因此\\2 \\1将第一个捕获组([\\w\\s]+)换成第二个([Tt]he*($|(?=\\s\$\\d{4}\$)))

现在，因为＆＃34;＆＃34;如果字符串没有＆＃34;＆＃34;只有[Tt]he*匹配零次或多次。在其中，一个空字符串与\\1交换，返回原始字符串。

Answer 2

这似乎对我有用：

#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers")

#use grep to find those with ", The" at the end
the.end=grep(", The$",x)

#trim movie titles to remove ", The"
trimmed=strtrim(x[the.end],nchar(x[the.end])-5)

#add "The " to the beginning of the trimmed titles
final=paste("The",trimmed)

#replace the trimmed elements of the movie vector
x[the.end]<-final

#take a look
x

请注意，除了结尾之外，这不会从名称中的任何位置删除“，The”...我认为这是您想要的行为。如果没有逗号或小写“the”，它也会遗漏任何“The”。要了解我的意思，请将此作为您的初始电影矢量：

#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers",
    "Dora, The Explorer", "Kill Bill Vol. 2 The", ",The Highlander",
    "Happening, the")

通过R在句子中移动文本

2 个答案: