通过在R中使用减少的值集替换大量值来清理数据

时间:2015-11-10 12:56:55

标签: r data-cleansing

我正在处理一个特定字段有许多可能值的数据集,但我想将值清理为一组减少的值。 例如,申请被批准或拒绝, 但它们使用不同的文本字符串记录在数据库中。 如何清洁它以便我获得干净的输出?

the_status <- c('2: approved (newer)',
                '5: approved (extended)',
                '3: denied (not appealed)',
                '14: denied (not appealed/withdrawn)',
                '20: approved',
                '21: denied',
                '24: not approved within 21 days',
                '28: not approved in 21 days')

data.frame(candidate_id = 1:8,
           status = the_status)

我想要的是什么:

data.frame(candidate_id = 1:8,
           status = c('approved', 'approved', 'denied',
                      'denied', 'approved', 'denied',
                      'denied', 'denied'))

注意:在实际数据集中,大约有100,000行, 并且字段status有大约30个不同的字符串, 我希望减少到大约4个值。

4 个答案:

答案 0 :(得分:3)

我会这样做:

  1. 确定唯一可能状态列表unique(the_status)
  2. 手工编码:

    code <- data.frame(orig_status=unique(the_status),
                       new_status=c("approved","denied",...)) 
    # You have to do this step manually
    
  3. 合并数据集
  4. 示例:

    set.seed(50)
    raw_data <- data.frame(orig_status=sample(the_status,replace=TRUE,100),
                           id=1:100)
    
    
    code <- data.frame(orig_status=unique(raw_data$orig_status),
                       new_status=c('denied','denied',
                                    'approved','denied',
                                    'approved','approved',
                                    'denied','denied'))
    
    code
    clean_data <- merge(raw_data,code)
    

    手动编码30个唯一值可能比寻找编程方式快得多。

答案 1 :(得分:1)

我们可以将“未批准”更改为“已拒绝”,然后使用sub进行提取。

df1$status <-  sub('[^:]+\\:\\s*(\\S+).*', '\\1', 
                sub('not approved', 'denied', df1$status))

答案 2 :(得分:1)

您可以使用merge()

执行此操作
d <- data.frame(candidate_id = 1:8, status = the_status)
red.tab <- data.frame(candidate_id = 1:8,
           status = c('approved', 'approved', 'denied',
                      'denied', 'approved', 'denied',
                      'denied', 'denied'))
merge(d, red.tab, by="candidate_id")

答案 3 :(得分:0)

这是我的解决方案。

x = sapply(the_status, function(t){ a = unlist(strsplit(t, ": ")); 
                                    b = unlist(strsplit(a[2], " \\("));
                                    c(a[1],b[1]) })

使用sapply,strsplit和unlist命令逐个拆分数据。

>t(x)
                                    [,1] [,2]                         
2: approved (newer)                 "2"  "approved"                   
5: approved (extended)              "5"  "approved"                   
3: denied (not appealed)            "3"  "denied"                     
14: denied (not appealed/withdrawn) "14" "denied"                     
20: approved                        "20" "approved"                   
21: denied                          "21" "denied"                     
24: not approved within 21 days     "24" "not approved within 21 days"
28: not approved in 21 days         "28" "not approved in 21 days"

返回一个矩阵。

df = data.frame(t(x))
rownames(df) = NULL
colnames(df) = c("candidate_id", "status")

将其转换为data.frame并设置名称。

df
  candidate_id                      status
1            2                    approved
2            5                    approved
3            3                      denied
4           14                      denied
5           20                    approved
6           21                      denied
7           24 not approved within 21 days
8           28     not approved in 21 days

结果如下。

df$candidate_id = 1:nrow(df$candidate_id)

如果您不想要原始ID,可以按如下方式更改它们:

df$candidate_id = rownames(df)

$scope.addField = function() {
  $scope.data.fields.push({
  content: "test " + counter++
});