我一直遇到这个问题,其中order()似乎工作不正常。此时,我认为这是由于数据类型的问题。即使在SQL中使用ORDER BY,也会出现类似的结果。请指教:
# read data from file
data <- read.csv("data/the_data.csv",
colClasses = "character")
# create a new data frame with rate converted to numeric
temp <- cbind(data$State, data$Hospital.Name,
as.numeric(
data$
Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
))
# add column names to the new data frame
colnames(temp) <- c("state","hospital","rate")
# remove any cases that include NA values
d <- data.frame(temp[complete.cases(temp),])
# reduce to cases that are restricted to Alabama
d <- d[d$state == "AL",]
# order the dataframe by rate, break any ties using
# the alphabetical order of the hospital name
d <- d[order(d$rate,d$hospital),]
这是我的输出:
state hospital rate
21 AL ANDALUSIA REGIONAL HOSPITAL 10.1
14 AL JACKSON HOSPITAL & CLINIC INC 10.2
81 AL BIRMINGHAM VA MEDICAL CENTER 10.4
42 AL FLORALA MEMORIAL HOSPITAL 10.4
...
30 AL MEDICAL CENTER ENTERPRISE 12.9
61 AL TRINITY MEDICAL CENTER 12.9
69 AL MONROE COUNTY HOSPITAL 13
31 AL ST VINCENTS BLOUNT 13
...
8 AL DEKALB REGIONAL MEDICAL CENTER 16.6
15 AL GEORGE H. LANIER MEMORIAL HOSPITAL 8.8
79 AL EVERGREEN MEDICAL CENTER 9.1
80 AL BAPTIST MEDICAL CENTER EAST 9.6
38 AL LAWRENCE MEDICAL CENTER 9.9
我在'data.table','dplyr'和'sqldf'中尝试了相同的排序请求。他们都取得了类似的结果。排序从大约10开始,一直到大约16,然后它决定8.8小于16.6并重新开始。
你能告诉我为什么会这样吗?
编辑:提供有关数据的更多信息
dput(droplevels(head(d,20))
结果如下:
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"),
hospital = structure(c(1L, 10L, 19L, 4L, 7L, 14L, 3L, 12L,
15L, 20L, 5L, 8L, 11L, 13L, 6L, 18L, 17L, 9L, 2L, 16L), .Label = c("ANDALUSIA REGIONAL HOSPITAL",
"ATMORE COMMUNITY HOSPITAL", "BIRMINGHAM VA MEDICAL CENTER",
"FLORALA MEMORIAL HOSPITAL", "GADSDEN REGIONAL MEDICAL CENTER",
"GEORGIANA HOSPITAL", "GROVE HILL MEMORIAL HOSPITAL", "HALE COUNTY HOSPITAL",
"JACK HUGHSTON MEMORIAL HOSPITAL", "JACKSON HOSPITAL & CLINIC INC",
"MOBILE INFIRMARY", "PARKWAY MEDICAL CENTER", "RIVERVIEW REGIONAL MEDICAL CENTER",
"SPRINGHILL MEDICAL CENTER", "ST VINCENT'S BIRMINGHAM", "ST VINCENT'S EAST",
"ST VINCENT'S ST CLAIR", "WALKER BAPTIST MEDICAL CENTER",
"WEDOWEE HOSPITAL", "WIREGRASS MEDICAL CENTER"), class = "factor"),
rate = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L,
6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("10.1",
"10.2", "10.4", "10.5", "10.6", "10.7", "10.8", "10.9"), class = "factor")), .Names = c("state",
"hospital", "rate"), row.names = c(21L, 14L, 17L, 42L, 53L, 77L,
81L, 34L, 36L, 40L, 24L, 55L, 66L, 28L, 29L, 51L, 74L, 87L, 88L,
7L), class = "data.frame")
当我使用data.table读取数据时,'rate'仍然是一个因素而不是数字:
data <- read.table("data/outcome-of-care-measures.csv")
str(d)
结果:
'data.frame': 90 obs. of 3 variables:
$ state : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hospital: Factor w/ 3775 levels "ABBEVILLE AREA MEDICAL CENTER",..: 74 1435 3640 971 1150 3033 292 2418 3212 3742 ...
$ rate : Factor w/ 105 levels "10","10.1","10.2",..: 2 3 5 5 5 5 5 6 7 7 ...
参考this stackoverflow post。试过这个:
data <- read.csv("data/outcome-of-care-measures.csv", colClasses = "character")
f <- data$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
summary(f)
Length Class Mode
4706 character character
f <- as.numeric(levels(f))[f]
summary(f)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 4706
我达到的结论是因子变量无法转换为数字。因此,它无法订购。如果您不这么认为,请告诉我。我偏爱Ackbar将军,“这是一个陷阱!”
答案 0 :(得分:2)
您可以将d$rate
转换为numeric
列
d$rate <- as.numeric(as.character(d$rate)
d1 <- d[order(d$rate, d$hospital),]
我怀疑在colClasses=character
(未经测试)中使用read.csv
会导致这种情况发生。您可以使用colClasses=c('character', 'character', 'numeric')
例如,如果我使用example data
read.table
d <- read.table('the_data.csv', colClasses='character')
str(d)
#'data.frame': 13 obs. of 3 variables:
#$ state : chr "AL" "AL" "AL" "AL" ...
#$ hospital: chr "ANDALUSIA REGIONAL HOSPITAL" "JACKSON HOSPITAL & CLINIC INC" "BIRMINGHAM VA MEDICAL CENTER" "FLORALA MEMORIAL HOSPITAL" ...
# $ rate : chr "10.1" "10.2" "10.4" "10.4" ...
即使没有指定colClasses
,也可正确读取。如果您不想要factor
列,可以在stringsAsFactors=FALSE
read.table
d <- read.table('the_data.csv')
str(d)
#'data.frame': 13 obs. of 3 variables:
#$ state : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
#$ hospital: Factor w/ 13 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 8 3 6 10 13 11 12 4 7 ...
#$ rate : num 10.1 10.2 10.4 10.4 12.9 12.9 13 13 16.6 8.8 ...
d[order(d$rate, d$hospital),]$rate
#[1] 8.8 9.1 9.6 9.9 10.1 10.2 10.4 10.4 12.9 12.9 13.0 13.0 16.6
使用dput
数据集
d$rate <- as.numeric(as.character(d$rate))
str(d)
#'data.frame': 20 obs. of 3 variables:
#$ state : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
#$ hospital: Factor w/ 20 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 10 19 4 7 14 3 12 15 20 ...
#$ rate : num 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 ...
d[order(d$rate, d$hospital),]$rate
#[1] 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 10.7 10.7 10.7 10.8 10.8
#[16] 10.8 10.8 10.8 10.8 10.9