Disaggregate one row of data to multiple rows

时间:2015-06-15 15:00:34

标签: r excel statistics dataset google-adwords

Goodafternoon!

I am having some trouble with my dataset. I am using a Google AdWords export for data analysis and I want to fit a logit regression model to the data to determine whether an experiment I have conducted impacts the conversion.

The problem is that the data is aggregated and to be able to perform logit regression, the dependent variable needs to be binary. So Instead of having a data point with (e.g.) 10 impressions, 5 Clicks and 2 conversions, I want 10 datapoints of which 5 are clicked on of which 2 have converted.

So I want to go from a data frame that looks like this (very simplified)

| Keyword      | Impressions | Clicks     | Conversions |
| SampleName   |      10     |      5     |     2       |

to this:

| Keyword      | Clicked     | Converted   |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |

How would I be able to do this for a very large dataset? I have looked everywhere, but I can't seem to find the solution. I would prefer to use R to do this, but I also have Excel and Stata installed.

Thanks in advance!

Edit Here is some code (extended with an extra row and column) for the data frame. I am quite new to R and this platform. This probably isn't the most clean way to code this, but here it goes:

Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")

Also, I am now running into the problem that things like average costs per click need to be repeated for clicks, since for each click a price is paid. So in the end, I need a dataframe that looks like this:

| Keyword   | Clicked     | Converted   |     CPC     |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample2   |      1      |      1      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |

Edit 2 (SOLVED)

akrun's solution seems to be right one when tested on the sample dataset, but if I try to test in on my actual dataset, it is giving the following error:

> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
+  Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
+  CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) : 
  invalid 'times' argument

The keywords don't contain any duplicates and the data does not have NA's:

> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE

a summary of the data:

> summary(df1)
   Keyword           Impressions          Clicks        Conversions       CostPerClick  
 Length:186145      Min.   :   1.00   Min.   : 1.000   Min.   :0.00000   Min.   :0.010  
 Class :character   1st Qu.:   7.00   1st Qu.: 1.000   1st Qu.:0.00000   1st Qu.:0.130  
 Mode  :character   Median :  16.00   Median : 1.000   Median :0.00000   Median :0.210  
                    Mean   :  32.93   Mean   : 2.167   Mean   :0.03368   Mean   :0.246  
                    3rd Qu.:  39.00   3rd Qu.: 2.000   3rd Qu.:0.00000   3rd Qu.:0.320  
                    Max.   :1521.00   Max.   :91.000   Max.   :4.00000   Max.   :3.680 

1 个答案:

答案 0 :(得分:3)

Try

library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
#       Keyword Clicked Converted
# 1: SampleName       1         1
# 2: SampleName       1         1
# 3: SampleName       1         0
# 4: SampleName       1         0
# 5: SampleName       1         0
# 6: SampleName       0         0
# 7: SampleName       0         0
# 8: SampleName       0         0
# 9: SampleName       0         0
#10: SampleName       0         0

Update

Using the updated dataset in the OP's post

setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
 CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
#    Keyword Clicked Converted  CPC
# 1: Sample1       1         1 0.26
# 2: Sample1       1         1 0.26
# 3: Sample1       1         0 0.26
# 4: Sample1       1         0 0.26
# 5: Sample1       1         0 0.26
# 6: Sample1       0         0 0.00
# 7: Sample1       0         0 0.00
# 8: Sample1       0         0 0.00
# 9: Sample1       0         0 0.00
#10: Sample1       0         0 0.00
#11: Sample2       1         1 0.15
#12: Sample2       1         0 0.15
#13: Sample2       1         0 0.15
#14: Sample2       0         0 0.00
#15: Sample2       0         0 0.00
#16: Sample2       0         0 0.00

data

 df1 <- structure(list(Keyword = "SampleName", Impressions = 10L, 
 Clicks = 5L, 
 Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks", 
 "Conversions"), class = "data.frame", row.names = c(NA, -1L))