根据两个字符列之间的差异创建R data.frame列

时间:2016-05-29 01:12:41

标签: r string dataframe

我有一个data.frame,df,其中我有2列,一列是歌曲的标题,另一列是合并的标题和艺术家。我希望创建一个单独的艺术家领域。 前三行显示在这里

title                               titleArtist
I'll Never Smile Again  I'll Never Smile Again TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS
Imagination         Imagination GLENN MILLER & HIS ORCHESTRA / RAY EBERLE
The Breeze And I    The Breeze And I JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY

此代码

对此数据集没有任何问题
library(stringr)
library(dplyr)

 df %>% 
 head(3) %>% 
 mutate(artist=str_to_title(str_trim(str_replace(titleArtist,title,"")))) %>% 
 select(artist,title)

 artist                                                         title
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again
2                  Jimmy Dorsey & His Orchestra / Bob Eberly       The Breeze And I
 3                  Glenn Miller & His Orchestra / Ray Eberle            Imagination

但是当我将它应用于数千行时,我得到了错误

Error: Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

#or for part of the mutation

df$artist <-str_replace(df$titleArtist,df$title,"")

Error in stri_replace_first_regex(string, pattern, replacement, opts_regex =    attr(pattern,  : 
 Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

我已从列中删除所有括号,代码似乎在我收到错误之前工作了一段时间

Error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

是否是另一个可能导致问题的特殊角色,或者可能是其他角色?

TIA

2 个答案:

答案 0 :(得分:2)

您的一般问题是str_replace将您的artist值视为正则表达式,因此除括号外的特殊字符会导致很多潜在错误。 stringi包装和简化的stringr库允许更细粒度的控件,包括将参数视为固定字符串而不是正则表达式。我没有原始数据,但是当我在以下位置抛出一些导致错误的字符时,这是有效的。

library(dplyr)
library(stringi)


df = data_frame(title = c("I'll Never Smile Again (",  "Imagination.*", "The Breeze And I(?>="),
           titleArtist = c("I'll Never Smile Again ( TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS",
                            "Imagination.* GLENN MILLER & HIS ORCHESTRA / RAY EBERLE",
                            "The Breeze And I(?>= JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY"))

df %>%
  mutate(artist=stri_trans_totitle(stri_trim(stri_replace_first_fixed(titleArtist,title,"")))) %>% 
  select(artist,title)

结果:

Source: local data frame [3 x 2]

artist                     title
(chr)                     (chr)
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again (
2                  Glenn Miller & His Orchestra / Ray Eberle             Imagination.*
3                  Jimmy Dorsey & His Orchestra / Bob Eberly      The Breeze And I(?>=

答案 1 :(得分:0)

 df <- data.frame(ID=11:13, T_A=c('a/b','b/c','x/y'))  # T_A Title/Artist 
   ID T_A
 1 11 a/b
 2 12 b/c
 3 13 x/y

 # Title Artist are separated by /
 > within(df, T_A<-data.frame(do.call('rbind', strsplit(as.character(T_A), '/', fixed=TRUE))))
  ID T_A.X1 T_A.X2
 1 11      a      b
 2 12      b      c
 3 13      x      y