R中美元价值和百分比的数据清理

时间:2013-02-21 23:39:10

标签: r data-cleansing

我一直在寻找R中的一些软件包来帮助我将美元值转换为漂亮的数值。我似乎无法找到一个(例如在plyr包中)。我正在寻找的基本要素是简单地删除$符号以及分别为百万和数千翻译“M”和“K”。

要复制,我可以使用以下代码:

require(XML)
theurl <- "http://www.kickstarter.com/help/stats"
html <- htmlParse(theurl)

allProjects <- readHTMLTable(html)[[1]]
names(allProjects) <-  c("Category","LaunchedProjects","TotalDollars","SuccessfulDollars","UnsuccessfulDollars","LiveDollars","LiveProjects","SuccessRate")

数据如下所示:

> tail(allProjects)
      Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars
8         Food            3,069     $16.79 M          $13.18 M             $2.78 M   $822.64 K
9      Theater            4,155     $13.45 M          $12.01 M             $1.22 M   $217.86 K
10      Comics            2,242     $12.88 M          $11.07 M           $941.31 K   $862.18 K
11     Fashion            2,799      $9.62 M           $7.59 M             $1.44 M   $585.98 K
12 Photography            2,794      $6.76 M           $5.48 M             $1.06 M   $220.75 K
13       Dance            1,185      $3.43 M           $3.13 M           $225.82 K     $71,322
   LiveProjects SuccessRate
8           189      39.27%
9           111      64.09%
10          134      46.11%
11          204      27.24%
12           83      36.81%
13           40      70.22%

我最终编写了自己的函数:

dollarToNumber <- function(vectorInput) {
  result <- c()
  for (dollarValue in vectorInput) {
    if (is.factor(dollarValue)) {  
      dollarValue = levels(dollarValue)
    }
    dollarValue <- gsub("(\\$|,)","",dollarValue)
    if(grepl(" K",dollarValue)) {
      dollarValue <- as.numeric(gsub(" K","",dollarValue)) * 1000
    } else if (grepl(" M",dollarValue)) {
      dollarValue <- as.numeric(gsub(" M","",dollarValue)) * 1000000
    }  
    if (!is.numeric(dollarValue)) {
      dollarValue <- as.numeric(dollarValue)
    }
    result <- append(result,dollarValue)
  }
    result
}

然后我用它来得到我想要的东西:

 allProjects <- transform(allProjects,
                          LaunchedProjects = as.numeric(gsub(",","",levels(LaunchedProjects))),
                          TotalDollars = dollarToNumber(TotalDollars),
                          SuccessfulDollars = dollarToNumber(SuccessfulDollars),
                          UnsuccessfulDollars = dollarToNumber(UnsuccessfulDollars),
                          LiveDollars = dollarToNumber(LiveDollars),
                          LiveProjects = as.numeric(LiveProjects),
                          SuccessRate = as.numeric(gsub("%","",SuccessRate))/100)

下面会给我这个结果:

> str(allProjects)
'data.frame':   13 obs. of  8 variables:
 $ Category           : Factor w/ 13 levels "Art","Comics",..: 6 8 4 9 12 11 1 7 13 2 ...
 $ LaunchedProjects   : num  10006 1185 1860 20025 2242 ...
 $ TotalDollars       : num  1.11e+08 9.68e+07 6.89e+07 6.66e+07 4.31e+07 ...
 $ SuccessfulDollars  : num  90990000 84960000 59020000 59390000 34910000 ...
 $ UnsuccessfulDollars: num  16640000 7900000 6830000 5480000 3700000 ...
 $ LiveDollars        : num  3090000 3970000 3010000 1750000 4470000 ...
 $ LiveProjects       : num  13 7 6 11 3 10 8 4 1 2 ...
 $ SuccessRate        : num  0.394 0.338 0.382 0.541 0.334 ...

我是R的新手,我觉得我写的代码是如此丑陋,当然有更好的方法来做到这一点而不重新发明轮子?我已经使用了apply,aaply,ddply函数但没有成功(我试图不使用for循环......)。最重要的是,在处理SuccessRate列时,我在R中找不到类似as.percentage函数的内容。我缺少什么?

任何指导都将不胜感激!

2 个答案:

答案 0 :(得分:3)

使R与您可能习惯使用的其他语言不同的一点是,以“矢量化”方式执行操作更好,一次操作整个矢量而不是循环遍历每个单独的值。因此,可以在没有dollarToNumber循环的情况下重写for函数:

dollarToNumber_vectorised <- function(vector) {
  # Want the vector as character rather than factor while
  # we're doing text processing operations
  vector <- as.character(vector)
  vector <- gsub("(\\$|,)","", vector)
  # Create a numeric vector to store the results in, this will give you
  # warning messages about NA values being introduced because the " K" values
  # can't be converted directly to numeric
  result <- as.numeric(vector)
  # Find all the "$N K" values, and modify the result at those positions
  k_positions <- grep(" K", vector)
  result[k_positions] <- as.numeric(gsub(" K","", vector[k_positions])) * 1000
  # Same for the "$ M" value
  m_positions <- grep(" M", vector)
  result[m_positions] <- as.numeric(gsub(" M","", vector[m_positions])) * 1000000
  return(result)
}

它仍然提供与原始功能相同的输出:

> dollarToNumber_vectorised(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934
# Don't worry too much about this warning
Warning message:
In dollarToNumber_vectorised(allProjects$LiveDollars) :
  NAs introduced by coercion
> dollarToNumber(allProjects$LiveDollars)
 [1] 3100000 3970000 3020000 1760000 4510000  762650  510860  823370  218590  865940
[11]  587670  221110   71934

答案 1 :(得分:2)

使用parseeval的解决方案:

ToNumber <- function(X)
{
  A <- gsub("%","*1e-2",gsub("K","*1e+3",gsub("M","*1e+6",gsub("\\$|,","",as.character(X)),fixed=TRUE),fixed=TRUE),fixed=TRUE)
  B <- try(sapply(A,function(a){eval(parse(text=a))}),silent=TRUE)
  if (is.numeric(B)) return (as.numeric(B)) else return(X)
}

#----------------------------------------------------------------------
# Example:
X <-
  read.table( header=TRUE,
              text = 
   'Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars  LiveProjects SuccessRate
        Food            3,069    "$16.79 M"         "$13.18 M"            "$2.78 M"  "$822.64 K" 189      39.27%
     Theater            4,155    "$13.45 M"         "$12.01 M"            "$1.22 M"  "$217.86 K" 111      64.09%
      Comics            2,242    "$12.88 M"         "$11.07 M"          "$941.31 K"  "$862.18 K" 134      46.11%
     Fashion            2,799     "$9.62 M"          "$7.59 M"            "$1.44 M"  "$585.98 K" 204      27.24%
 Photography            2,794     "$6.76 M"          "$5.48 M"            "$1.06 M"  "$220.75 K"  83      36.81%
       Dance            1,185     "$3.43 M"          "$3.13 M"          "$225.82 K"    "$71,322"  40      70.22%' )

numX <- as.data.frame(lapply(as.list(X),ToNumber))

options(width=1000)
print(numX,row.names=FALSE)

#    Category LaunchedProjects TotalDollars SuccessfulDollars UnsuccessfulDollars LiveDollars LiveProjects SuccessRate
#        Food             3069     16790000          13180000             2780000      822640          189      0.3927
#     Theater             4155     13450000          12010000             1220000      217860          111      0.6409
#      Comics             2242     12880000          11070000              941310      862180          134      0.4611
#     Fashion             2799      9620000           7590000             1440000      585980          204      0.2724
# Photography             2794      6760000           5480000             1060000      220750           83      0.3681
#       Dance             1185      3430000           3130000              225820       71322           40      0.7022