R按列中的新行拆分数据框

时间:2015-04-29 04:24:35

标签: regex r substring

我正在尝试用新行“\ n”拆分列中的字符串。 这是一个dataframe sample_data:

 test_data <- data.frame(ID=c('john@xxx.com', 'sally@xxx.com'),
                  Changes=c('3 max cost changes
  productxyz > pb100  > a : Max cost decreased from $0.98 to $0.83
  productxyz > pb2  > a : Max cost decreased from $1.07 to $0.91
  productxyz > pb2  > b : Max cost decreased from $0.65 to $0.55', 
                            '2 max cost changes
  productabc > pb1000  > d : Max cost decreased from $1.07 to $0.91
  productabc > pb1000  > x : Max cost decreased from $1.44 to $1.22'), stringsAsFactors=FALSE)

我的目标是将价格提取到列中并获得如下结果集:

ID              Prev_Price    New_Price
john@xxx.com     $0.98            $0.83
john@xxx.com     $1.07            $0.91
john@xxx.com     $0.65            $0.55
sally@xxx.com    $1.07            $0.91
sally@xxx.com    $1.44            $1.22

我尝试过使用tidyr包,但结果却充满了N / A.

vars <- c("Prev_Price","New_Price")
seperate(sample_data, Changes, into = vars, sep = "[A-Za-z]+from", extra= "drop")

非常感谢任何帮助。

谢谢!

2 个答案:

答案 0 :(得分:3)

尝试

<div id="divID" ng-click="updateImageSrc()" ...

或者

df1$ID <- df1$ID[df1$ID!=''][cumsum(df1$ID!='')]
library(stringi)
setNames(data.frame(df1$ID, do.call(rbind,stri_extract_all(df1$Changes, 
       regex='\\$\\d*'))), c('ID', 'Prev_Price', 'New_Price'))
 #   ID Prev_Price New_Price
 #1  A        $20       $10
 #2  A        $11       $10
 #3  B        $13       $12
 #4  B        $15       $12

或者

library(tidyr)
extract(df1, Changes, into=c('Prev_Price', 'New_Price'), 
          '[^$]*(\\$\\d*)[^$]*(\\$\\d*)')
#   ID Prev_Price New_Price
#1  A        $20       $10
#2  A        $11       $10
#3  B        $13       $12
#4  B        $15       $12

注意:&#34;更改&#34;可以删除

或仅使用library(data.table)#v1.9.5+ setDT(df1)[, c('Prev_Price', 'New_Price') := tstrsplit(Changes, '[A-Za-z ]+')[-1]][] # ID Changes Prev_Price New_Price #1: A down from $20 to $10 $20 $10 #2: A down from $11 to $10 $11 $10 #3: B down from $13 to $12 $13 $12 #4: B down from $15 to $12 $15 $12 方法

base R

更新

如果元素位于同一个单元格中,则一个选项将使用devel版本data.frame(ID=df1$ID, read.table(text=gsub('[^$]*(\\$\\d+)', ' \\1 ', df1$Changes),col.names=c('Prev_Price', 'New_Price'), stringsAsFactors=FALSE)) # ID Prev_Price New_Price #1 A $20 $10 #2 A $11 $10 #3 B $13 $12 #4 B $15 $12 即。 v1.9.5 +。它可以从here

安装

在这里,我们使用相同的代码来拆分&#39;更改&#39; (data.table),然后tstrsplit(Changes,..)将输出设置为长格式,将melt指定为measure.vars,如果需要,list按ID&# 39;并删除不需要的列(&#39;变量&#39;)。

order

或者我们可以像以前一样使用 melt( setDT(df2)[, paste0('V',1:4) := tstrsplit(Changes, '[A-Za-z ]+')[-1]][,-2, with=FALSE], id.var='ID', measure=list(c('V1', 'V3'), c('V2', 'V4')), value.name=c('Prev_Price', 'New_Price'))[order(ID)][, variable:=NULL] # ID Prev_Price New_Price #1: A $20 $10 #2: A $11 $10 #3: B $13 $12 #4: B $15 $12 ,然后使用gsub中的long转换为reshape格式

base R

UPDATE2

对于新数据集(&#34; df3&#34;),我们可以使用 d1 <- data.frame(ID=df2$ID,read.table(text=gsub('[^$]*(\\$\\d+)', ' \\1 ', df2$Changes))) colnames(d1)[-1] <- paste0(c('Prev_Price.', 'New_Price.'), rep(1:2,each=2)) reshape(d1, idvar='ID', varying=2:ncol(d1), sep=".", direction='long') # ID time Prev_Price New_Price #A.1 A 1 $20 $10 #B.1 B 1 $13 $12 #A.2 A 2 $11 $10 #B.2 B 2 $15 $12 提取stri_extract_all_regex后跟数字,包括&的小数($) #34;变更&#34;列,使用'\\$[0-9.]+'将第一列与我们在将输出更改为Map后从list获得的stri_extract_all_regex输出结合起来(因为我们需要交替元素为在不同的列中),然后matrixrbind)。

do.call(rbind,

数据

library(stringi)
res <- do.call(rbind,
       Map(function(x,y) data.frame(x,matrix(y, ncol=2, byrow=TRUE, 
           dimnames=list(NULL, c("Prev_Price", "New_Price")))),
        df3$ID, stri_extract_all_regex(df3$Changes, '\\$[0-9.]+')))
row.names(res) <- NULL
res
#              x Prev_Price New_Price
#1  john@xxx.com      $0.98     $0.83
#2  john@xxx.com      $1.07     $0.91
#3  john@xxx.com      $0.65     $0.55
#4 sally@xxx.com      $1.07     $0.91
#5 sally@xxx.com      $1.44     $1.22

答案 1 :(得分:1)

df <- data.frame(ID=c('A','','B',''), Changes=c('down from $20 to $10','down from $11 to $10','down from $13 to $12','down from $15 to $12'), stringsAsFactors=F );
with(list(ss=strsplit(df$Changes,'\\s+')),transform(df,ID=ID[ID!=''][cumsum(ID!='')],Prev_Price=sapply(ss,function(v)v[3]),New_Price=sapply(ss,function(v)v[5]),Changes=NULL));
##   ID Prev_Price New_Price
## 1  A        $20       $10
## 2  A        $11       $10
## 3  B        $13       $12
## 4  B        $15       $12

另一种方法:

with(df,cbind(ID=ID[ID!=''][cumsum(ID!='')],setNames(as.data.frame(do.call(rbind,strsplit(Changes,'\\s+'))[,c(3,5)]),c('Prev_Price','New_Price'))));
## same result