我有一个data.frame
,有110M行,其中很多都是相同的。我需要聚合表,合并相同的行,并在频率上为结果帧添加一列。
library(plyr)
library(data.table)
DataFrame1 = structure(list(company_nm = c("Acme Markets (NAI)", "Acme Markets (NAI)",
"Acme Markets (NAI)", "Acme Markets (NAI)", "Acme Markets (NAI)",
"Acme Markets (NAI)", "Acme Markets (NAI)", "Acme Markets (NAI)",
"Acme Markets (NAI)", "Acme Markets (NAI)"), `tier 4` = c("Vitamins",
"Internal Analgesics", "Nutrition Bars", "Carbonated Soft Drinks",
"Bottled Water", "Bottled Water", "Bottled Water", "Bottled Water",
"Bottled Water", "Popcorn"), `tier 3` = c("Vitamin Supplements",
"Analgesics", "Nutrition", "Beverage", "Beverage", "Beverage",
"Beverage", "Beverage", "Beverage", "Snacks"), `tier 2` = c("Health Care",
"Health Care", "Health Care", "Dry Grocery", "Dry Grocery", "Dry Grocery",
"Dry Grocery", "Dry Grocery", "Dry Grocery", "Dry Grocery"),
`tier 1` = c("Health & Personal Care", "Health & Personal Care",
"Health & Personal Care", "Grocery", "Grocery", "Grocery",
"Grocery", "Grocery", "Grocery", "Grocery"), Market = c("Randolph, NJ",
"Yonkers, NY", "Newark, NJ", "Newark, NJ", "Lancaster, PA",
"Wilmington, DE", "Philadelphia, PA", "Lancaster, PA", "Wilmington, DE",
"Randolph, NJ"), RetMktAC = c("Acme Markets (NAI), Randolph, NJ",
"Acme Markets (NAI), Yonkers, NY", "Acme Markets (NAI), Newark, NJ",
"Acme Markets (NAI), Newark, NJ", "Acme Markets (NAI), Lancaster, PA",
"Acme Markets (NAI), Wilmington, DE", "Acme Markets (NAI), Philadelphia, PA",
"Acme Markets (NAI), Lancaster, PA", "Acme Markets (NAI), Wilmington, DE",
"Acme Markets (NAI), Randolph, NJ")), .Names = c("company_nm",
"tier 4", "tier 3", "tier 2", "tier 1", "Market", "RetMktAC"), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
df = DataFrame1[rep(seq_len(nrow(DataFrame1)), each=10000),]
我研究了ddply
和data.table
但是a)与第二次相比,第一次是如此缓慢,b)无法弄清楚如何有效(即没有列出要聚合的所有列)。
结果DF会像:"all columns from original" "freq"
谢谢!