在R中的数据框和列表中执行操作

时间:2016-07-20 09:43:10

标签: r list vector dataframe

以下是csv文件的数据片段。该列表包含John旅行的城市名称和他在那里停留的时间。

sno   City   hours stayed
1   London     5
2   London     4
3   Dubai     2
4   Mumbai     8
5   Sydney     16
6   Sydney     16
7   Dubai     2
8   London     8
9   London     9
10   Paris     17 

我需要帮助计算以下内容:

  1. 约翰访问量最大的城市名称(按访问次数);
  2. 他逗留时间最长(累计逗留)小时的城市名称;
  3. 他在一次访问中停留时间最长的城市名称,多少小时以及哪个城市;
  4. 每个城市的平均小时数(累计小时数)。

3 个答案:

答案 0 :(得分:1)

我们可以使用dplyr按“城市”获取summarize输出,然后从输出中获取具有最大值的“城市”。可能有很多方法,但dplyr是最容易理解和简单的。此外,对于大数据集,它 使用dplyr/data.table会很有效。

library(dplyr)
res <- df1 %>% 
          group_by(City) %>% 
          summarise(n = n(),
                    totalHours = sum(hours_stayed),
                    maxHours = max(hours_stayed), 
                    meanHours = mean(hours_stayed))
res %>%   
          summarise_each(funs(City[which.max(.)]), -City)
#      n totalHours maxHours meanHours
#   <chr>      <chr>    <chr>     <chr>
#1 London     Sydney    Paris     Paris

每个城市的平均活动数量可以从'res'本身获得

res %>% 
    select(City, meanHours)
#    City meanHours
#   <chr>     <dbl>
#1  Dubai       2.0
#2 London       6.5
#3 Mumbai       8.0
#4  Paris      17.0
#5 Sydney      16.0

注:

1)如果是最大访问次数或其他情况,则选择第一个最大值。

2)我们可以在一个管道流中完成所有操作,而不是一次又一次地调用函数。

另一个有效的选择是data.table

library(data.table)
res2 <- setDT(df1)[, .(n = .N, totalHours = sum(hours_stayed,
                        maxHours = max(hours_stayed),
                        meanHours = mean(hours_stayed))
                 ,  by =  City]

答案 1 :(得分:1)

这些问题一气呵成,解决方案非常简单,可以在R基础上完成。

#Name of most visited city by john (by number of visits)

which.max(table(df$City))
#London 
# 2 

#Name of City where he stayed for longest (cumulative stay) hour
aggdata = aggregate(hoursstayed ~ City, df, sum)
aggdata[which.max(aggdata$hoursstayed), ]

#    City hoursstayed
#5 Sydney          32

#Name of city where he stayed for longest time in a single visit ,
# how many hours and which city

df[which.max(df$hoursstayed), ]

#   sno  City hoursstayed
#10  10  Paris          17

#average number of hours in each of the city (cumulative hours)

aggregate(hoursstayed ~ City, df, mean)

#   City hoursstayed
#1  Dubai         2.0
#2 London         6.5
#3 Mumbai         8.0
#4  Paris        17.0
#5 Sydney        16.0

答案 2 :(得分:0)

 library(dplyr)
 df <- tbl_df(df)

予。约翰访问量最大的城市名称(按访问次数)

 df %>%
 select(City) %>% 
 table() %>% 
 sort(decreasing=T)

 #  London  Dubai Sydney Mumbai  Paris 
 #       4      2      2      1      1

 # 2nd alternative
 df %>%
 group_by(City) %>%
 summarise(n=n()) %>%
 arrange(desc(n))

 # Source: local data frame [5 x 2]

 #     City     n
 #   (fctr) (int)
 # 1 London     4
 # 2  Dubai     2
 # 3 Sydney     2
 # 4 Mumbai     1
 # 5  Paris     1

II。他住的城市名称最长(累计逗留)小时

 df %>%
 group_by(City) %>%
 mutate(cumsum(hours_stayed)) %>% 
 arrange(City)

 # Source: local data frame [10 x 4]
 # Groups: City [5]

 #     sno   City hours_stayed cumsum(hours_stayed)
 #    (int) (fctr)        (int)                (int)
 # 1      3  Dubai            2                    2
 # 2      7  Dubai            2                    4
 # 3      1 London            5                    5
 # 4      2 London            4                    9
 # 5      8 London            8                   17
 # 6      9 London            9                   26
 # 7      4 Mumbai            8                    8
 # 8     10  Paris           17                   17
 # 9      5 Sydney           16                   16
 # 10     6 Sydney           16                   32



 df %>%
 group_by(City) %>%
 summarise(sum(cumsum(hours_stayed)))

 # Source: local data frame [5 x 2]

 #    City sum(cumsum(hours_stayed))
 #   (fctr)                     (int)
 # 1  Dubai                         6
 # 2 London                        57
 # 3 Mumbai                         8
 # 4  Paris                        17
 # 5 Sydney                        48

III。他在一次访问中停留时间最长的城市名称,多少小时以及哪个城市

 df %>%
 group_by(City) %>%
 summarise(max(hours_stayed))

 # Source: local data frame [5 x 2]

 #     City max(hours_stayed)
 #   (fctr)             (int)
 # 1  Dubai                 2
 # 2 London                 9
 # 3 Mumbai                 8
 # 4  Paris                17
 # 5 Sydney                16

IV。每个城市的平均小时数(累计小时数)

 df %>%
 group_by(City) %>%
 summarise(sum(mean(hours_stayed)))

 # Source: local data frame [5 x 2]

 #     City sum(mean(hours_stayed))
 #   (fctr)                   (dbl)
 # 1  Dubai                     2.0
 # 2 London                     6.5
 # 3 Mumbai                     8.0
 # 4  Paris                    17.0
 # 5 Sydney                    16.0