我有一篇文章标题测试文件(测试$ title)及其社交总分(测试$ total_shares)。我可以使用say:
找到最常用的三元组library(tau)
trigrams = textcnt(test$title, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
head(trigrams, 20)
但是,我希望能够做的是按平均份额而非出现次数来计算最高三元组。
我可以使用grep找到任何特定三元组的平均份额,例如
HowTo <- filter(test, grepl('how to create', ignore.case = TRUE, title))
然后使用:
summary(HowTo)
查看具有该三元组的头条新闻的平均份额。
但这是一个耗时的过程。我想做的是按平均份额计算数据集中的最高三元组。谢谢你的帮助。
这是一个示例数据集。 https://d380wq8lfryn3c.cloudfront.net/wp-content/uploads/2017/06/16175029/test4.csv
我倾向于使用
从标题中删除非ascii字符test$title <- sapply(test$title,function(row) iconv(row, from = "UTF-8", to = "ASCII", sub=""))
答案 0 :(得分:0)
是的,这有点棘手。我将它分解成易于处理的块然后将它们串起来,这意味着我可能错过了一些捷径,但至少它似乎有效。
哦,忘了说。如果像你一样使用textcnt()
,那么将形成三个标题,包括一个标题的结尾和下一个标题的开头。我认为这是不可取的,并找到了绕过它的方法。
library(tau)
library(magrittr)
test0 <- read.csv(paste0("https://d380wq8lfryn3c.cloudfront.net/",
"wp-content/uploads/2017/06/16175029/test4.csv"),
header=TRUE, stringsAsFactors=FALSE)
test0[7467,] #problematic line
test <- test0
# test <- head(test0, 20)
test$title <- iconv(test$title, from="UTF-8", to="ASCII", sub=" ")
test$title <- test$title %>%
tolower %>%
gsub("[,/]", " ", .) %>% #replace , and / with space
gsub("[^a-z ]", "", .) %>% #keep only letters and spaces
gsub(" +", " ", .) %>% #shrink multiple spaces to one
gsub("^ ", "", .) %>% #remove leading spaces
gsub(" $", "", .) #remove trailing spaces
test[7467,] #problematic line resolved
trigrams <- sapply(test$title,
function(s) names(textcnt(s, n=3, method="string")))
names(trigrams) <- test$total_shares
trigrams <- do.call(c, trigrams)
trigrams.df <- data.frame(trigrams, shares=as.numeric(names(trigrams)))
# aggregate shares by trigram. The number of shares of identical trigrams
# are summarized using some function (sum, mean, median etc.)
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df, sum)
# more than one statistic can be calculated
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df,
FUN=function(x) c(mean=mean(x), sum=sum(x), nhead=length(x)))
trigrams_share <- do.call(data.frame, trigrams_share)
trigrams_share[[1]] <- as.character(trigrams_share[[1]])
# top five trigrams by average number of shares,
# of those that was found in three or more hedlines
trigrams_share <- trigrams_share[order(
trigrams_share[2], decreasing=TRUE), ]
head(trigrams_share[trigrams_share[["shares.nhead"]] >= 3, ], 5)
# trigrams shares.mean shares.sum shares.nhead
# 37588 the secret to 42852.75 171411 4
# 43607 will be a 24779.00 123895 5
# 44945 your career elearning 23012.00 92048 4
# 31454 raises million to 21378.67 64136 3
# 6419 classroom elearning industry 18812.38 150499 8
如果连接断开
# dput(head(test0, 20)):
test <- structure(list(
title = c("Top 3 Myths About BYOD In The Classroom - eLearning Industry",
"The Emotional Weight of Being Graded, for Better or Worse",
"Online learning startup Coursera raises $64M at an $800M valuation",
"LinkedIn doubles down on education with LinkedIn Learning, updates desktop site",
"Create Your eLearning Resume - eLearning Industry",
"The Disruption of Digital Learning: Ten Things We Have Learned",
"'Top universities to offer full degrees online in five years' - BBC News",
"Schools will teach 'soft skills' from 2017, but assessing them presents a challenge",
"Top 5 Lead-Generating Ideas for Your Content Marketing",
"'Top universities to offer full degrees online in five years' - BBC News",
"The long-distance learners of Aleppo - BBC News",
"eLearning Solutions for Business",
"6 Top eLearning Course Reviewer Tools And Selection Criteria - eLearning Industry",
"eLearning Elevated",
"When Teachers and Technology Let Students Be Masters of Their Own Learning",
"Aviation Technical English online elearning course",
"How the Pioneers of the MOOC Got It Wrong",
"Study challenges cost and price myths of online education",
"10 Easy Ways to Integrate Technology in Your Classroom",
"7 e-learning trends for educational institutions in 2017"
), total_shares = c(13646L, 12120L, 8328L, 5945L, 5853L, 5108L,
4944L, 3570L, 3104L, 2841L, 2463L, 2227L, 2218L, 2210L, 2200L,
2117L, 2039L, 1876L, 1861L, 1779L)), .Names = c("title", "total_shares"
), row.names = c(NA, 20L), class = "data.frame")