从单独的代码表重新编码

时间:2016-11-02 17:26:29

标签: r match recode

我有一个如下数据集:

+------------+
| Expression |
+------------+
| CCR(A-B)   |
+------------+
| 1-2(A)     |
+------------+
| 3-4(A-B)   |
+------------+
| 5(A)       |
+------------+

代码数据框中描述了列 Dim filename As String = "C:\apps\test.exe" Dim filepath As String = Path.GetDirectoryName(filename) Dim proc = New Process() With { .StartInfo = New ProcessStartInfo() With { .FileName = filename, .WorkingDirectory = filepath, .UseShellExecute = False, .RedirectStandardOutput = True, .RedirectStandardError = True, .CreateNoWindow = True } } dat1 <- read.table(header=TRUE, text=" ID Age Align Weat 8645 15-24 A 1 6228 15-24 B 1 5830 15-24 A 3 1844 25-34 B 1 4461 35-44 B 2 2119 35-44 C 2 2115 45-54 A 1 ") dat1 ID Age Align Weat 1 8645 15-24 A 1 2 6228 15-24 B 1 3 5830 15-24 A 3 4 1844 25-34 B 1 5 4461 35-44 B 2 6 2119 35-44 C 2 7 2115 45-54 A 1 Age的属性:

Align

我希望匹配代码数据框以获取我的数据集,如下所示:

Weat

我目前正在使用以下代码执行我的任务,这对于具有500列的大型数据集和这些列的代码表效率不高。

dat2 <- read.table(header=TRUE, text="
                   Code  Desc  Column
                   15-24    Young  Age
                   25-34    Young  Age
                   35-44    Middle  Age
                   45-54    Middle  Age
                   A    Straight  Align
                   B    Curve  Align
                   C    Hill  Align
                   1    Clear  Weat
                   2    Cloudy  Weat
                   3    Rain  Weat
                   ")
dat2
    Code     Desc Column
1  15-24    Young    Age
2  25-34    Young    Age
3  35-44   Middle    Age
4  45-54   Middle    Age
5      A Straight  Align
6      B    Curve  Align
7      C     Hill  Align
8      1    Clear   Weat
9      2   Cloudy   Weat
10     3     Rain   Weat

2 个答案:

答案 0 :(得分:1)

尝试一个简单的for循环:

varnames <- unique(dat2$Column)
dat3 <- dat1
for (i in varnames)
{   startvars <- names(dat3)[!names(dat3) %in% i]
    dat3 <- merge(dat3, subset(dat2, Column==i),
                        by.x=i, by.y="Code")[,c(startvars, "Desc")]
    colnames(dat3)[names(dat3) %in% "Desc"] <- i 
}

结果:

    ID    Age    Align   Weat
1 8645  Young Straight  Clear
2 2115 Middle Straight  Clear
3 6228  Young    Curve  Clear
4 1844  Young    Curve  Clear
5 4461 Middle    Curve Cloudy
6 2119 Middle     Hill Cloudy
7 5830  Young Straight   Rain

这显然不是超级高效的,带有一些dcast的data.table解决方案可能是有序的,但我会留下让别人去思考。

PS:通过将stringsAsFactors= F, colClasses= rep("character",4))添加到read.table

,必须稍微重新格式化第一个数据集

答案 1 :(得分:1)

您可以在for

中对变量使用dat1循环
# 'intersect' is needed to recode only those columns which have description
for (each_column in intersect(colnames(dat1), dat2$Column)){
    curr_dict = dat2$Column %in% each_column
    code = dat2$Code[curr_dict]
    descr = dat2$Desc[curr_dict]
    dat1[[each_column]] = descr[match(dat1[[each_column]], code)]
}