以下是csv文件的数据片段。该列表包含John旅行的城市名称和他在那里停留的时间。
sno City hours stayed
1 London 5
2 London 4
3 Dubai 2
4 Mumbai 8
5 Sydney 16
6 Sydney 16
7 Dubai 2
8 London 8
9 London 9
10 Paris 17
我需要帮助计算以下内容:
答案 0 :(得分:1)
我们可以使用dplyr
按“城市”获取summarize
输出,然后从输出中获取具有最大值的“城市”。可能有很多方法,但dplyr
是最容易理解和简单的。此外,对于大数据集,它
使用dplyr/data.table
会很有效。
library(dplyr)
res <- df1 %>%
group_by(City) %>%
summarise(n = n(),
totalHours = sum(hours_stayed),
maxHours = max(hours_stayed),
meanHours = mean(hours_stayed))
res %>%
summarise_each(funs(City[which.max(.)]), -City)
# n totalHours maxHours meanHours
# <chr> <chr> <chr> <chr>
#1 London Sydney Paris Paris
每个城市的平均活动数量可以从'res'本身获得
res %>%
select(City, meanHours)
# City meanHours
# <chr> <dbl>
#1 Dubai 2.0
#2 London 6.5
#3 Mumbai 8.0
#4 Paris 17.0
#5 Sydney 16.0
注:
1)如果是最大访问次数或其他情况,则选择第一个最大值。
2)我们可以在一个管道流中完成所有操作,而不是一次又一次地调用函数。
另一个有效的选择是data.table
library(data.table)
res2 <- setDT(df1)[, .(n = .N, totalHours = sum(hours_stayed,
maxHours = max(hours_stayed),
meanHours = mean(hours_stayed))
, by = City]
答案 1 :(得分:1)
这些问题一气呵成,解决方案非常简单,可以在R基础上完成。
#Name of most visited city by john (by number of visits)
which.max(table(df$City))
#London
# 2
#Name of City where he stayed for longest (cumulative stay) hour
aggdata = aggregate(hoursstayed ~ City, df, sum)
aggdata[which.max(aggdata$hoursstayed), ]
# City hoursstayed
#5 Sydney 32
#Name of city where he stayed for longest time in a single visit ,
# how many hours and which city
df[which.max(df$hoursstayed), ]
# sno City hoursstayed
#10 10 Paris 17
#average number of hours in each of the city (cumulative hours)
aggregate(hoursstayed ~ City, df, mean)
# City hoursstayed
#1 Dubai 2.0
#2 London 6.5
#3 Mumbai 8.0
#4 Paris 17.0
#5 Sydney 16.0
答案 2 :(得分:0)
library(dplyr)
df <- tbl_df(df)
予。约翰访问量最大的城市名称(按访问次数)
df %>%
select(City) %>%
table() %>%
sort(decreasing=T)
# London Dubai Sydney Mumbai Paris
# 4 2 2 1 1
# 2nd alternative
df %>%
group_by(City) %>%
summarise(n=n()) %>%
arrange(desc(n))
# Source: local data frame [5 x 2]
# City n
# (fctr) (int)
# 1 London 4
# 2 Dubai 2
# 3 Sydney 2
# 4 Mumbai 1
# 5 Paris 1
II。他住的城市名称最长(累计逗留)小时
df %>%
group_by(City) %>%
mutate(cumsum(hours_stayed)) %>%
arrange(City)
# Source: local data frame [10 x 4]
# Groups: City [5]
# sno City hours_stayed cumsum(hours_stayed)
# (int) (fctr) (int) (int)
# 1 3 Dubai 2 2
# 2 7 Dubai 2 4
# 3 1 London 5 5
# 4 2 London 4 9
# 5 8 London 8 17
# 6 9 London 9 26
# 7 4 Mumbai 8 8
# 8 10 Paris 17 17
# 9 5 Sydney 16 16
# 10 6 Sydney 16 32
df %>%
group_by(City) %>%
summarise(sum(cumsum(hours_stayed)))
# Source: local data frame [5 x 2]
# City sum(cumsum(hours_stayed))
# (fctr) (int)
# 1 Dubai 6
# 2 London 57
# 3 Mumbai 8
# 4 Paris 17
# 5 Sydney 48
III。他在一次访问中停留时间最长的城市名称,多少小时以及哪个城市
df %>%
group_by(City) %>%
summarise(max(hours_stayed))
# Source: local data frame [5 x 2]
# City max(hours_stayed)
# (fctr) (int)
# 1 Dubai 2
# 2 London 9
# 3 Mumbai 8
# 4 Paris 17
# 5 Sydney 16
IV。每个城市的平均小时数(累计小时数)
df %>%
group_by(City) %>%
summarise(sum(mean(hours_stayed)))
# Source: local data frame [5 x 2]
# City sum(mean(hours_stayed))
# (fctr) (dbl)
# 1 Dubai 2.0
# 2 London 6.5
# 3 Mumbai 8.0
# 4 Paris 17.0
# 5 Sydney 16.0