在R中操纵字符串

时间:2013-01-21 02:24:59

标签: string r stata

我有以下stata代码,我尝试将其转换为R

dataframe

    y1  y2  y3  y4  y5  y6  y11 y12 y13 y14 y15 y16
    5   0   0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   1   2   1   2   0   0
    0   0   0   0   0   0   1   1   1   2   0   0
    0   0   0   0   0   0   1   8   1   2   0   0
    0   0   0   0   0   0   1   1   1   2   0   0
    0   0   0   0   0   0   1   1   1   2   0   0
    1   1   0   0   0   0   0   0   0   0   0   0
    1   1   0   0   0   0   0   0   0   0   0   0
    1   1   0   0   0   0   0   0   0   0   0   0
    1   1   0   0   0   0   0   0   0   0   0   0
    2   2   5   1   1   2   2   2   1   1   2       1

local z1 "y1 y12 y3 y4 y5 y6"
local z2 "y11 y12 y13 y14 y15 y16"
local i = 1
local n : word count `z1'
gen k=.

while `i'<=`n' {

    local z1`i' : word `i' of `z1'
        local z2`i' : word `i' of `z2'
        replace k=max(0,`z1`i'')*(`z2`i''==2|`z2`i''==1)
        local i=`i'+1
    } 

预期产出:

k
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2

我使用了以下等效的R代码:

      dataframe$z1<- "y1 y12 y3 y4 y5 y6"
      dataframe$z2<- "y11 y12 y13 y14 y15 y16"
      i<-  1
      n<-sapply(gregexpr("\\W+", z1), length) + 1
      dataframe$k<-NA

    for (j in i:n){
  .... #I wanted to refer to each word of z1 
  ...#e.g.,dataframe$z1[i]<-which gives word i of z1 
  .. #I wanted to refer to each word of z2
  ... #e.g.,dataframe$z1[i]<-whicg gives word i of z2 

   dataframe$k<-with(dataframe, pmax(0,z1[j])*ifelse(z2[j] %in% c(1,2),1,0))

}

虚线表示我无法在R中找到等效代码。如果你能在这方面帮助我,我将不胜感激。

    # Updated Stata codes and data (number of observation is reduced to 10)

y1  y2  y3  y4  y5  y6  y11 y12 y13 y14 y15 y16
5   0   0   0   0   0   0   0   0   0   0   0
5   0   0   0   0   0   0   0   0   0   0   0
5   0   0   0   0   0   0   0   0   0   0   0
5   0   0   0   0   0   0   0   0   0   0   0
5   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0

y111    y112    y113    y114    y115    y116    y1111   y1112   y1113   y1114   y1115   y1116
1   0   0   0   0   0   81000   0   0   0   0   0
1   0   0   0   0   0   86000   0   0   0   0   0
1   0   0   0   0   0   96000   0   0   0   0   0
1   0   0   0   0   0   84000   0   0   0   0   0
1   0   0   0   0   0   76000   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0

    local z1 "y1 y2 y3 y4 y5 y6"
    local z2 "y11 y12 y13 y14 y15 y16"
    local z3 "y111 y112 y113 y114 y115 y116"
    local z4 "y1111 y1112 y1113 y1114 y1115 y1116"
    local i = 1
    local n : word count `z1'
    gen k=.
    gen r=0
    gen s=0
    gen t=0
    while `i'<=`n' {

        local z1`i' : word `i' of `z1'
            local z2`i' : word `i' of `z2'
            local z3`i' : word `i' of `z3'
            local z4`i' : word `i' of `z4'


            replace k=max(0,`z4`i'')*(`z1`i''==5|`z1`i''==10|`z2`i''==2|`z2`i''==1|`z3`i''==1)
            replace r=r+k if `i'<=3
            replace s=s+k if `i'>3
            replace t=t+k
            local i=`i'+1
        } 

#Expected output

t       r   s       k
81000   81000   0   0
86000   86000   0   0
96000   96000   0   0
84000   84000   0   0
76000   76000   0   0
0           0   0   0
0           0   0   0
0           0   0   0
0           0   0   0
0           0   0   0

4 个答案:

答案 0 :(得分:2)

Stata代码没有任何意义。根据给定的数据,代码循环遍历变量y1,...,y6和变量y11,...,y16。它最初设置了一个新变量k,但无论前面的变量是什么,结果都是

max(0, y6) * (y16 == 2|y16 == 1)

对R用户来说应该比大多数代码更透明。函数max()返回其较大的参数并按行运行。

我怀疑这是什么意思,但我不会试图猜测是什么意思。

答案 1 :(得分:2)

Nick非常重视您的max调用未引用之前的k,因此它会折叠到第六列的检查。这是R等价物,假设你真的想要行最大值。我先将数据存储在txt文件中。

data_all <- read.table("data.txt", header=T)
data_one <- data_all[,1:6]
data_two <- data_all[,7:12]
my_fun_one <- function(x, y) {
  x * ((y == 1) | (y == 2))
}
data_three <- mapply(FUN = my_fun_one, x=data_one, y=data_two)
my_fun_two <- function(x) {
  max(x, 0)
}
k <- apply(data_three, 1, FUN = my_fun_two)

这会产生

> k
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5

更新 - 这是您更新的完整问题的解决方案。它或多或少地使用相同的构建块。一旦您熟悉了R的基础知识,我认为您将获得applylapplymapply的最大里程数。

data_one <- read.table("data_one.txt", header=T)
data_two <- read.table("data_two.txt", header=T)
z1 <- data_one[, 1:6]
z2 <- data_one[, 7:12]
z3 <- data_two[, 1:6]
z4 <- data_two[, 7:12]
my_fun <- function(w, x, y, z) {
  z * (z > 0) * ((w %in% c(5, 10)) | (x %in% c(1, 2)) | (y == 1))
}
z5 <- mapply(FUN=my_fun, w=z1, x=z2, y=z3, z=z4)
r <- rowSums(z5[, 1:3]) 
s <- rowSums(z5[, 4:6]) 
t <- rowSums(z5)
k <- z5[, ncol(z5)]
data_three <- data.frame(t, r, s, k)

这会产生:

> data_three
       t     r s k
1  81000 81000 0 0
2  86000 86000 0 0
3  96000 96000 0 0
4  84000 84000 0 0
5  76000 76000 0 0
6      0     0 0 0
7      0     0 0 0
8      0     0 0 0
9      0     0 0 0
10     0     0 0 0

答案 2 :(得分:2)

这是原始Stata代码的较短版本。它采用给定的Stata变量(列,向量)y1 ... y6y11 ... y16

gen k = .

forval i = 1/6 {
    replace k = max(0, y`i') * (y1`i' == 2|y1`i' == 1)
} 

forval循环超过1,2,3,4,5,6。存在宏替换,因此第一次循环循环时RHS为max(0, y1) * (y11 == 2|y11 == 1),并且最后一次循环循环,RHS为max(0, y6) * (y16 == 2|y16 == 1)。因此,循环结果不可避免地是最后一次计算的结果。

(编辑)我确认不需要local个陈述。

(第二次编辑)我还假设原始y12中的local z1 "y1 y12 y3 y4 y5 y6"y2的拼写错误。

答案 3 :(得分:1)

Stata代码可以简化为已经发信号通知

gen k = .
gen r = 0
gen s = 0
gen t = 0
quietly forval i = 1/6 {
replace k = max(0, y111`i')*(y`i'==5|y`i'==10|y1`i'==2|y1`i'==1|y11`i'==1)
     replace r = r+k if `i'<=3
     replace s = s+k if `i'>3
     replace t = t+k
} 

修订后的代码确实清楚了为什么覆盖k没问题,因为k的每个新结果都会被及时使用。