Cast不能处理数据帧

时间:2012-11-09 19:54:50

标签: r reshape2

我正在使用大型数据框,我想在其上执行数据透视表类型的功能。我一直在尝试使用reshape2包,但由于某种原因,我的熔融数据框架没有重塑。

我想采取这样的框架:

County      Industry   Type    Variable   Value 
LA          Plumbing     Tax       Rev       1000 
LA          Plumbing     No tax    Emp       100 
LA          Plumbing     Tax       Pay       500

并使其成为(聚合在Type上):

        Plumbing       Tailors
County  Rev   Emp  Pay Rev   Emp  Pay
LA      1000  100  500 1000  50   65

我正在运行以下代码:

dcast(m.data, county ~ variable + industry)

但它根本没有改变数据帧。我搞砸了哪里?

编辑:

我提供了有关此问题的更多信息。在进入熔融数据框架之前,我正在做一些非常糟糕的清理工作,以便将数据传送到我想要的位置。我知道下面的代码并不理想,真的应该修复,但我基本上上传了多个CSV文件(具有相同的列名),加入它们,重新编码一些值,删除一些列,选择数据的子集,转它进入一个熔化的框架,然后尝试使用dcast重塑它。我删除了我重新编码某些值的代码但该部分似乎工作正常。这是一些代码:

data1 <- read.table("census_data_r_1.csv",header=TRUE,sep=",",stringsAsFactors=FALSE) 
data2 <- read.table("census_data_r_2.csv",header=TRUE,sep=",", stringsAsFactors=FALSE)
fulldata <- rbind(data1,data2)
delete <- c("GEO.id","GEO.id2","NAICS.id","OPTAX.id","YEAR.id")
fulldata <- fulldata[, !(names(fulldata) %in% delete)]
colnames(fulldata) <- c("county","industry","tax_type","firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")
fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")] <- recode.variables(fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")],"'N' -> 'Nothing';'D' -> 'Withheld';'b' -> 20;'c' -> 100;'e' -> 250;'a' -> 10;'g' -> 1000;'f' -> 500;'Q' -> 'No Rev Collected';'h' -> 2500;'i' -> 5000;'j' -> 10000;'l' -> 50000;'k' -> 25000;'S' -> 'Bad Data';'m' -> 100000;")
fulldata.sub <- subset(fulldata, subset = (tax_type %in% c('Total', 'All establishments')) & (!(revenue %in% c('Nothing', 'Withheld','No Rev Collected'))) & (!(non_emp_firms %in% c('Nothing','Withheld'))))
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

现在我收到以下错误:

  

结构错误(有序,dim = ns):     dims [product 18300]与object [0]

的长度不匹配

dput(head(fulldata.sub,40))的输出:

structure(list(county = c("Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Barbour County, Alabama", "Barbour County, Alabama"), industry = c("Rental and leasing services", 
"Professional, scientific, and technical services", "Professional, scientific, and technical services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Architectural, engineering, and related services", "Real estate and rental and leasing", 
"Real estate", "Lessors of real estate", "Offices of real estate agents and brokers", 
"Offices of real estate agents and brokers", "Activities related to real estate", 
"Real estate property managers", "Offices of real estate appraisers", 
"Consumer goods rental", "Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Offices of certified public accountants", "Tax preparation services", 
"Architectural, engineering, and related services", "Architectural services", 
"Engineering services", "Specialized design services", "Computer systems design and related services", 
"Computer systems design and related services", "Management, scientific, and technical consulting services", 
"Advertising, public relations, and related services", "Veterinary services", 
"Administrative and support and waste management and remediation services", 
"Administrative and support services", "Employment services", 
"Business support services", "Investigation and security services", 
"Services to buildings and dwellings", "Exterminating and pest control services", 
"Janitorial services", "Landscaping services", "Waste management and remediation services", 
"Lessors of real estate", "Legal services"), tax_type = c("Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "Total", "Total", "Total", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "All establishments"
), firms = c("10", "61", "61", "14", "14", "10", "358", "312", 
"77", "161", "161", "74", "52", "16", "28", "79", "79", "36", 
"20", "77", "13", "37", "19", "27", "27", "63", "17", "26", "250", 
"238", "26", "14", "17", "157", "16", "29", "96", "12", "11", 
"19"), revenue = c("8433", "42285", "42285", "8581", "8581", 
"5571", "266692", "201777", "59742", "104768", "104768", "37267", 
"32141", "4615", "20691", "33203", "33203", "19805", "3160", 
"39318", "10494", "21167", "6833", "12391", "12391", "21496", 
"11097", "18388", "163661", "145935", "30746", "4048", "13849", 
"77076", "9934", "15832", "47411", "17726", "1585", "6439"), 
    payroll = c("1641", "15473", "15473", "3506", "3506", "2229", 
    "59476", "47937", "4053", "30180", "30180", "13704", "11902", 
    "1674", "4854", "17298", "17298", "9718", "1122", "15263", 
    "3688", "8649", "908", "4429", "4429", "7335", "2634", "6073", 
    "67526", "62354", "19529", "1002", "6824", "27688", "3181", 
    "8632", "14434", "5172", "265", "1431"), num_employees = c("56", 
    "386", "386", "127", "127", "41", "1987", "1643", "160", 
    "1030", "1030", "453", "406", "42", "217", "491", "491", 
    "217", "138", "356", "69", "204", "45", "111", "111", "165", 
    "101", "282", "2807", "2686", "806", "53", "399", "1241", 
    "110", "399", "675", "121", "23", "36"), non_emp_firms = c("8", 
    "330", "330", "49", "49", "35", "2358", "2289", "648", "840", 
    "840", "801", "186", "32", "19", "208", "208", "20", "40", 
    "203", "21", "74", "107", "99", "99", "356", "82", "10", 
    "1452", "1435", "25", "153", "61", "982", "12", "526", "350", 
    "17", "40", "16"), non_emp_firms_rev = c("882", "10111", 
    "10111", "493", "493", "1280", "164778", "160968", "55888", 
    "33321", "33321", "71759", "25870", "1504", "692", "2961", 
    "2961", "533", "466", "9220", "889", "5387", "4448", "3235", 
    "3235", "14395", "10337", "602", "35998", "33953", "708", 
    "3991", "806", "18726", "329", "6246", "9974", "2045", "1978", 
    "488")), .Names = c("county", "industry", "tax_type", "firms", 
"revenue", "payroll", "num_employees", "non_emp_firms", "non_emp_firms_rev"
), row.names = c(6L, 7L, 9L, 19L, 21L, 25L, 54L, 55L, 56L, 65L, 
66L, 70L, 71L, 74L, 77L, 99L, 101L, 103L, 105L, 109L, 111L, 115L, 
119L, 125L, 127L, 131L, 139L, 143L, 147L, 148L, 152L, 155L, 159L, 
162L, 163L, 165L, 167L, 169L, 174L, 180L), class = "data.frame")

修改

另外一个编辑,包括&gt; str(fulldata.sub)和str(m.data)的输出

  

data.frame':130098 obs。 9个变量:
   $ county:因子w / 3237水平“阿比维尔县,南卡罗来纳州”,..:121 121 121 121 121 121 121 121 131 131 ...
   $ industry:因子w / 369级别“会计,税务准备,簿记和工资单服务”,..:283 239 239 1 1 33 358 358 274 273 ...
   $ tax_type:因子w / 4级“所有机构”,..:4 1 1 1 1 1 1 1 4 4 ...
   $ firm:num 10 61 61 14 14 10 4 4 358 312 ...
   $ revenue:num 31466 21347 21347 31717 31717 ...
   $ payroll:num 5521 4863 4863 13729 13729 ...
   $ num_employees:num 4664 3625 3625 571 571 ...
   $ non_emp_firms:num 3122 1887 1887 2486 2486 ...
   $ non_emp_firms_rev:num 17550 96 96 12669 12669 ...
  'data.frame':780588 obs。 5个变量:
   $ county:因子w / 3237水平“阿比维尔县,南卡罗来纳州”,..:121 121 121 121 121 121 121 121 131 131 ...
   $ industry:因子w / 369级别“会计,税务准备,簿记和工资单服务”,..:283 239 239 1 1 33 358 358 274 273 ...
   $ tax_type:因子w / 4级“所有机构”,..:4 1 1 1 1 1 1 1 4 4 ...
   $ variable:因子w / 6级“公司”,“收入”,..:1 1 1 1 1 1 1 1 1 1 ...
   $ value:num 10 61 61 14 14 10 4 4 358 312 ...

1 个答案:

答案 0 :(得分:2)

如果您查看str(fulldata.sub),您会看到第4列到第9列中的数字被视为字符。因此,当melt()字符串转换为因子时。然后你试图对类型因子的变量运行sum()评估 - 它不计算。

您只需转换为数字,如下所示:

...    
fulldata.sub[4:9] <- sapply(fulldata.sub[4:9],as.numeric)
# Then run your melt/cast sequence
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

或者,修复数据导入。这可能会发生,因为有字符串,如“”,“ - ”,“,”,“n / a”,“na”或“”等。使用read.csv您可以通过设置na.strings=c("erroneous_string","other_erroneous_string",...)论点。