Goodafternoon!
I am having some trouble with my dataset. I am using a Google AdWords export for data analysis and I want to fit a logit regression model to the data to determine whether an experiment I have conducted impacts the conversion.
The problem is that the data is aggregated and to be able to perform logit regression, the dependent variable needs to be binary. So Instead of having a data point with (e.g.) 10 impressions, 5 Clicks and 2 conversions, I want 10 datapoints of which 5 are clicked on of which 2 have converted.
So I want to go from a data frame that looks like this (very simplified)
| Keyword | Impressions | Clicks | Conversions |
| SampleName | 10 | 5 | 2 |
to this:
| Keyword | Clicked | Converted |
| SampleName | 1 | 1 |
| SampleName | 1 | 1 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
How would I be able to do this for a very large dataset? I have looked everywhere, but I can't seem to find the solution. I would prefer to use R to do this, but I also have Excel and Stata installed.
Thanks in advance!
Edit Here is some code (extended with an extra row and column) for the data frame. I am quite new to R and this platform. This probably isn't the most clean way to code this, but here it goes:
Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")
Also, I am now running into the problem that things like average costs per click need to be repeated for clicks, since for each click a price is paid. So in the end, I need a dataframe that looks like this:
| Keyword | Clicked | Converted | CPC |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample2 | 1 | 1 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
Edit 2 (SOLVED)
akrun's solution seems to be right one when tested on the sample dataset, but if I try to test in on my actual dataset, it is giving the following error:
> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
+ Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
+ CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) :
invalid 'times' argument
The keywords don't contain any duplicates and the data does not have NA's:
> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE
a summary of the data:
> summary(df1)
Keyword Impressions Clicks Conversions CostPerClick
Length:186145 Min. : 1.00 Min. : 1.000 Min. :0.00000 Min. :0.010
Class :character 1st Qu.: 7.00 1st Qu.: 1.000 1st Qu.:0.00000 1st Qu.:0.130
Mode :character Median : 16.00 Median : 1.000 Median :0.00000 Median :0.210
Mean : 32.93 Mean : 2.167 Mean :0.03368 Mean :0.246
3rd Qu.: 39.00 3rd Qu.: 2.000 3rd Qu.:0.00000 3rd Qu.:0.320
Max. :1521.00 Max. :91.000 Max. :4.00000 Max. :3.680
答案 0 :(得分:3)
Try
library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
# Keyword Clicked Converted
# 1: SampleName 1 1
# 2: SampleName 1 1
# 3: SampleName 1 0
# 4: SampleName 1 0
# 5: SampleName 1 0
# 6: SampleName 0 0
# 7: SampleName 0 0
# 8: SampleName 0 0
# 9: SampleName 0 0
#10: SampleName 0 0
Using the updated dataset in the OP's post
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
# Keyword Clicked Converted CPC
# 1: Sample1 1 1 0.26
# 2: Sample1 1 1 0.26
# 3: Sample1 1 0 0.26
# 4: Sample1 1 0 0.26
# 5: Sample1 1 0 0.26
# 6: Sample1 0 0 0.00
# 7: Sample1 0 0 0.00
# 8: Sample1 0 0 0.00
# 9: Sample1 0 0 0.00
#10: Sample1 0 0 0.00
#11: Sample2 1 1 0.15
#12: Sample2 1 0 0.15
#13: Sample2 1 0 0.15
#14: Sample2 0 0 0.00
#15: Sample2 0 0 0.00
#16: Sample2 0 0 0.00
df1 <- structure(list(Keyword = "SampleName", Impressions = 10L,
Clicks = 5L,
Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks",
"Conversions"), class = "data.frame", row.names = c(NA, -1L))