我有一个包含40802基因名称的数据框列表,我有14000篇文章信息的数据框。文章信息包含文章,摘要,日,月,年。
我已将日期转换为普通格式,将摘要转换为字符。
我希望及时得到X的图,基因名称的频率出现在摘要中。 EG
| Date | Gene Name | Frequency |
|------------|-----------|-----------|
| 2017-03-20 | GAPDH | 5 |
| 2017-03-21 | AKT | 6 |
基本上,我想知道过去100天内最常发表的基因名称,并有一个时间表来查看所述基因名称的演变。像趋势一样。
library(RISmed)
##Research the query - can be anything relevant to protein expression.
##Multiple research not tested yet
search_topic <- 'protein expression'
##Evaluate the query with reldate = days before today, retmax = maximun number of returned results
search_query <- EUtilsSummary(search_topic, retmax=15000, reldate = 100)
##explore the outcome
summary(search_query)
##get the ids for tall the queries to get the articles
QueryId(search_query)
##get all the records associated with the ID - THIS TAKES LOOONG TIME
records<- EUtilsGet(search_query)
##Analyze the structure
str(records)
summary(records)
##Create a data frame with article/abstract/date
pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records),
"Day"=DayPubmed(records), "Month" = MonthPubmed(records), "Year"=YearPubmed(records))
##explore the data
head(pubmed_data,1)
##gene names
genename <- read.csv("genename.csv", header = T, stringsAsFactors = F)
##remove any NA tittles
pubmed <-pubmed_data[-which(is.na(pubmed_data$Title)), ]
##Coerce the date to YYYY-MM-DD
pubmed$Date <- as.Date( paste( pubmed$Day , pubmed$Month , sep = "." ) , format = "%d.%m" )
我已经阅读了很多内容,无法弄清楚如何在pubmed$Abstract
内找到genemane [1,1],
并计算它按时间出现的次数。
制作一个情节,其中X是最后100天,而线条prot将是基因名的频率,
传说将是基因名称。因此可以观察到一种趋势。
我真的很感激如何做到这一点。
我已经尝试了tm
,尝试了很多不同的事情,但仍然遇到了障碍。我的观念错了吗?
答案 0 :(得分:0)
# from: https://stackoverflow.com/questions/45485701/count-frequency-of-words-in-text-and-create-plot
# get some text
txt <- c("I have a list of data frame with 40802 gene names and I have data frame with 14000 article information.
The article information contains Article, Abstract, Day, Month, Year.I have transformed the date into normal format,
and the abstract as character. I want to have a plot of X in time, and the frequency of the gene names appears in the abstract.
Basically, I want to know the gene names most frequently published in the last 100 days and have a timeline to see the evolution of said genenames.
Something like a trend.")
# cut to ngramms for dataframe example
txt <- strwrap(x = txt,width = 20)
# create some data frame
pubmed_data <- data.frame(Title=abbreviate(names.arg = txt,minlength = 5,method = "left.kept",named = F),Abstract=txt,stringsAsFactors = F)
pubmed_data
# tm package
library(tm)
wrds <- termFreq(doc = pubmed_data$Abstract,control = list(tolower=TRUE,removePunctuation=TRUE,removeNumbers=TRUE))
wrds <- sort(unclass(wrds),decreasing = T)
wrds <- data.frame(tokens=names(wrds),n=as.integer(wrds))
wrds$tokens <- reorder(wrds$tokens,wrds$n)
library(ggplot2)
ggplot(data = wrds,aes(x = tokens,y = n,fill=n))+geom_bar(stat="identity")+scale_y_continuous(breaks = 1:max(wrds$n))+
coord_flip()
# tidy packages
library(tidytext)
library(dplyr)
wrds2 <- pubmed_data %>% select(-Title) %>% unnest_tokens(input = "Abstract",output = "tokens",to_lower = T) %>%
filter(grepl(pattern="\\D+",x=.$tokens)) %>% group_by(tokens) %>%
count %>% ungroup %>% mutate(tokens=reorder(tokens,n))
ggplot(data = wrds2,aes(x = tokens,y = n,fill=n))+geom_bar(stat="identity")+scale_y_continuous(breaks = 1:max(wrds$n))+
coord_flip()