如何有效地分析excel中的数据?

时间:2014-02-09 07:46:57

标签: r excel analysis

我在excel中有数据,列代表一些类别,每行代表单个用户关于类别的数据。 并且他们没有以任何方式排序。以下是数据样本

user  food      date      ........
a     pizza     1/1/2013
b     fries     1/3/2013
c     sandwich  5/2/2013
a     sandwich  2/3/2010

我想找到每个用户拥有什么样的食物的概率。 所以我想输出

a  pizza     20%
   sandwich  50%
   fries     30%

b  pizza     10%
   noodle    20%

最有效的方法是什么? 我正在使用Excel过滤用户并使用R查找每种食物的频率并在excel表中键入所有食物。

1 个答案:

答案 0 :(得分:2)

如果你已经知道一些R,我会建议你在R中完全咬掉子弹和这种工作。 Excel是一个有其用途的工具,但对于严谨的数据分析,R更好,值得投资。

这就是我在R中的表现:

# Create some sample data
foods = c('pizza', 'sandwich', 'tuna', 'noodles', 'fries')
persons = letters[1:10] # letters is a variable containing all the letters of the alphabet, standarly available in R
df = data.frame(food = sample(foods, 1000, replace = TRUE),
                person = sample(persons, 1000, replace = TRUE))

# Get frequencies
table_df = table(df)
# Divide by total food eaten by each person
# In both `apply` and `sweep`, the `2` refers to performing the operation per column
prob_df = apply(table_df, 2, 
             function(food_per_person) {
                  (food_per_person / sum(food_per_person)) * 100
             })
# An alternative to using `apply` is to use `sweep`:
prob_df = sweep(table_df, 2, margin.table(table_df, 2), FUN = "/")
prob_df
# All close to 20%, as expected
        person
                  a        b        c        d        e        f        g
  fries    21.34831 22.88136 17.17172 19.04762 19.81132 18.34862 16.03774
  noodles  19.10112 19.49153 19.19192 23.80952 18.86792 22.01835 19.81132
  pizza    13.48315 18.64407 16.16162 19.04762 16.03774 13.76147 23.58491
  sandwich 24.71910 21.18644 22.22222 13.09524 23.58491 30.27523 18.86792
  tuna     21.34831 17.79661 25.25253 25.00000 21.69811 15.59633 21.69811
          person
                  h        i        j
  fries    23.14815 18.75000 11.76471
  noodles  17.59259 26.04167 24.70588
  pizza    19.44444 19.79167 18.82353
  sandwich 23.14815 14.58333 24.70588
  tuna     16.66667 20.83333 20.00000

检查结果,即每人的百分比是否增加到100%:

colSums(prob_df)
  a   b   c   d   e   f   g   h   i   j 
100 100 100 100 100 100 100 100 100 100