order()函数似乎不起作用 - 可能是由于变量类型

时间:2014-11-30 11:52:15

标签: r

我一直遇到这个问题,其中order()似乎工作不正常。此时,我认为这是由于数据类型的问题。即使在SQL中使用ORDER BY,也会出现类似的结果。请指教:

# read data from file
data <- read.csv("data/the_data.csv",
                colClasses = "character")

# create a new data frame with rate converted to numeric
temp <- cbind(data$State, data$Hospital.Name,
    as.numeric(
      data$
      Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
      ))

# add column names to the new data frame
colnames(temp) <- c("state","hospital","rate")

# remove any cases that include NA values
d <- data.frame(temp[complete.cases(temp),])

# reduce to cases that are restricted to Alabama
d <- d[d$state == "AL",]

# order the dataframe by rate, break any ties using
# the alphabetical order of the hospital name
d <- d[order(d$rate,d$hospital),]

这是我的输出:

state                                          hospital rate
21    AL                       ANDALUSIA REGIONAL HOSPITAL 10.1
14    AL                     JACKSON HOSPITAL & CLINIC INC 10.2
81    AL                      BIRMINGHAM VA MEDICAL CENTER 10.4
42    AL                         FLORALA MEMORIAL HOSPITAL 10.4
...
30    AL                         MEDICAL CENTER ENTERPRISE 12.9
61    AL                            TRINITY MEDICAL CENTER 12.9
69    AL                            MONROE COUNTY HOSPITAL   13
31    AL                                ST VINCENTS BLOUNT   13
...
8     AL                    DEKALB REGIONAL MEDICAL CENTER 16.6
15    AL                GEORGE H. LANIER MEMORIAL HOSPITAL  8.8
79    AL                          EVERGREEN MEDICAL CENTER  9.1
80    AL                       BAPTIST MEDICAL CENTER EAST  9.6
38    AL                           LAWRENCE MEDICAL CENTER  9.9

我在'data.table','dplyr'和'sqldf'中尝试了相同的排序请求。他们都取得了类似的结果。排序从大约10开始,一直到大约16,然后它决定8.8小于16.6并重新开始。

你能告诉我为什么会这样吗?

编辑:提供有关数据的更多信息

dput(droplevels(head(d,20))

结果如下:

structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"), 
hospital = structure(c(1L, 10L, 19L, 4L, 7L, 14L, 3L, 12L, 
15L, 20L, 5L, 8L, 11L, 13L, 6L, 18L, 17L, 9L, 2L, 16L), .Label = c("ANDALUSIA REGIONAL    HOSPITAL", 
"ATMORE COMMUNITY HOSPITAL", "BIRMINGHAM VA MEDICAL CENTER", 
"FLORALA MEMORIAL HOSPITAL", "GADSDEN REGIONAL MEDICAL CENTER", 
"GEORGIANA HOSPITAL", "GROVE HILL MEMORIAL HOSPITAL", "HALE COUNTY HOSPITAL", 
"JACK HUGHSTON MEMORIAL HOSPITAL", "JACKSON HOSPITAL & CLINIC INC", 
"MOBILE INFIRMARY", "PARKWAY MEDICAL CENTER", "RIVERVIEW REGIONAL MEDICAL CENTER", 
"SPRINGHILL MEDICAL CENTER", "ST VINCENT'S BIRMINGHAM", "ST VINCENT'S EAST", 
"ST VINCENT'S ST CLAIR", "WALKER BAPTIST MEDICAL CENTER", 
"WEDOWEE HOSPITAL", "WIREGRASS MEDICAL CENTER"), class = "factor"), 
rate = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L, 
6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("10.1", 
"10.2", "10.4", "10.5", "10.6", "10.7", "10.8", "10.9"), class = "factor")), .Names =     c("state", 
"hospital", "rate"), row.names = c(21L, 14L, 17L, 42L, 53L, 77L, 
81L, 34L, 36L, 40L, 24L, 55L, 66L, 28L, 29L, 51L, 74L, 87L, 88L, 
7L), class = "data.frame")

当我使用data.table读取数据时,'rate'仍然是一个因素而不是数字:

data <- read.table("data/outcome-of-care-measures.csv")
str(d)

结果:

    'data.frame':   90 obs. of  3 variables:
 $ state   : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ hospital: Factor w/ 3775 levels "ABBEVILLE AREA MEDICAL CENTER",..: 74 1435 3640 971 1150 3033 292 2418 3212 3742 ...
 $ rate    : Factor w/ 105 levels "10","10.1","10.2",..: 2 3 5 5 5 5 5 6 7 7 ...

参考this stackoverflow post。试过这个:

data <- read.csv("data/outcome-of-care-measures.csv", colClasses = "character")
f <- data$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
summary(f)

Length     Class      Mode 
 4706 character character

f <- as.numeric(levels(f))[f]
summary(f)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 NA      NA      NA     NaN      NA      NA    4706 

我达到的结论是因子变量无法转换为数字。因此,它无法订购。如果您不这么认为,请告诉我。我偏爱Ackbar将军,“这是一个陷阱!”

1 个答案:

答案 0 :(得分:2)

您可以将d$rate转换为numeric

d$rate <- as.numeric(as.character(d$rate)
d1 <- d[order(d$rate, d$hospital),]

我怀疑在colClasses=character(未经测试)中使用read.csv会导致这种情况发生。您可以使用colClasses=c('character', 'character', 'numeric')

例如,如果我使用example data

阅读read.table
 d <- read.table('the_data.csv', colClasses='character')
 str(d)
 #'data.frame': 13 obs. of  3 variables:
 #$ state   : chr  "AL" "AL" "AL" "AL" ...
 #$ hospital: chr  "ANDALUSIA REGIONAL HOSPITAL" "JACKSON HOSPITAL & CLINIC INC" "BIRMINGHAM VA MEDICAL CENTER" "FLORALA MEMORIAL HOSPITAL" ...
# $ rate    : chr  "10.1" "10.2" "10.4" "10.4" ...

即使没有指定colClasses,也可正确读取。如果您不想要factor列,可以在stringsAsFactors=FALSE

中使用read.table
 d <- read.table('the_data.csv')
 str(d)
 #'data.frame': 13 obs. of  3 variables:
 #$ state   : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
 #$ hospital: Factor w/ 13 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 8 3 6 10 13 11 12 4 7 ...
 #$ rate    : num  10.1 10.2 10.4 10.4 12.9 12.9 13 13 16.6 8.8 ...

 d[order(d$rate, d$hospital),]$rate
 #[1]  8.8  9.1  9.6  9.9 10.1 10.2 10.4 10.4 12.9 12.9 13.0 13.0 16.6

更新

使用dput数据集

 d$rate <- as.numeric(as.character(d$rate))
 str(d)
 #'data.frame': 20 obs. of  3 variables:
 #$ state   : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
 #$ hospital: Factor w/ 20 levels "ANDALUSIA REGIONAL    HOSPITAL",..: 1 10 19 4 7 14 3 12 15 20 ...
 #$ rate    : num  10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 ...

 d[order(d$rate, d$hospital),]$rate
 #[1] 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 10.7 10.7 10.7 10.8 10.8
 #[16] 10.8 10.8 10.8 10.8 10.9