取ngram的项频率结果

时间:2017-09-17 18:14:21

标签: r tm

我尝试使用ngram(bigram,trigram等)提取频率。但是在最终结果中我采用单项的频率。为什么会这样?我有什么必须在代码中修复。

以下是我使用的数据:

df <- structure(list(text = c("the discipline of phenomenology is defined by its domain of study  its methods  and its main results ", 
"phenomenology studies structures of conscious experience as experienced from the first person point of view  along with relevant conditions of experience  the central structure of an experience is its intentionality  the way it is directed through its content or meaning toward a certain object in the world ", 
"we all experience various types of experience including perception  imagination  thought  emotion  desire  volition  and action  thus  the domain of phenomenology is the range of experiences including these types  among others   experience includes not only relatively passive experience as in vision or hearing  but also active experience as in walking or hammering a nail or kicking a ball   the range will be specific to each species of being that enjoys consciousness  our focus is on our own  human  experience  not all conscious beings will  or will be able to  practice phenomenology  as we do  ", 
"conscious experiences have a unique feature  we experience them  we live through them or perform them  other things in the world we may observe and engage  but we do not experience them  in the sense of living through or performing them  this experiential or first person feature — that of being experienced — is an essential part of the nature or structure of conscious experience  as we say  “i see   think   desire   do …” this feature is both a phenomenological and an ontological feature of each experience  it is part of what it is for the experience to be experienced  phenomenological  and part of what it is for the experience to be  ontological  ", 
"how shall we study conscious experience  we reflect on various types of experiences just as we experience them  that is to say  we proceed from the first person point of view  however  we do not normally characterize an experience at the time we are performing it  in many cases we do not have that capability  a state of intense anger or fear  for example  consumes all of one s psychic focus at the time  rather  we acquire a background of having lived through a given type of experience  and we look to our familiarity with that type of experience  hearing a song  seeing a sunset  thinking about love  intending to jump a hurdle  the practice of phenomenology assumes such familiarity with the type of experiences to be characterized  importantly  also  it is types of experience that phenomenology pursues  rather than a particular fleeting experience — unless its type is what interests us ", 
"classical phenomenologists practiced some three distinguishable methods     we describe a type of experience just as we find it in our own  past  experience  thus  husserl and merleau ponty spoke of pure description of lived experience     we interpret a type of experience by relating it to relevant features of context  in this vein  heidegger and his followers spoke of hermeneutics  the art of interpretation in context  especially social and linguistic context     we analyze the form of a type of experience  in the end  all the classical phenomenologists practiced analysis of experience  factoring out notable features for further elaboration ", 
"these traditional methods have been ramified in recent decades  expanding the methods available to phenomenology  thus     in a logico semantic model of phenomenology  we specify the truth conditions for a type of thinking  say  where i think that dogs chase cats  or the satisfaction conditions for a type of intention  say  where i intend or will to jump that hurdle      in the experimental paradigm of cognitive neuroscience  we design empirical experiments that tend to confirm or refute aspects of experience  say  where a brain scan shows electrochemical activity in a specific region of the brain thought to subserve a type of vision or emotion or motor control   this style of “neurophenomenology” assumes that conscious experience is grounded in neural activity in embodied action in appropriate surroundings — mixing pure phenomenology with biological and physical science in a way that was not wholly congenial to traditional phenomenologists ", 
"what makes an experience conscious is a certain awareness one has of the experience while living through or performing it  this form of inner awareness has been a topic of considerable debate  centuries after the issue arose with locke s notion of self consciousness on the heels of descartes  sense of consciousness  conscience  co knowledge   does this awareness of experience consist in a kind of inner observation of the experience  as if one were doing two things at once   brentano argued no   is it a higher order perception of one s mind s operation  or is it a higher order thought about one s mental activity   recent theorists have proposed both   or is it a different form of inherent structure   sartre took this line  drawing on brentano and husserl   these issues are beyond the scope of this article  but notice that these results of phenomenological analysis shape the characterization of the domain of study and the methodology appropriate to the domain  for awareness of experienc... <truncated>
"conscious experience is the starting point of phenomenology  but experience shades off into less overtly conscious phenomena  as husserl and others stressed  we are only vaguely aware of things in the margin or periphery of attention  and we are only implicitly aware of the wider horizon of things in the world around us  moreover  as heidegger stressed  in practical activities like walking along  or hammering a nail  or speaking our native tongue  we are not explicitly conscious of our habitual patterns of action  furthermore  as psychoanalysts have stressed  much of our intentional mental activity is not conscious at all  but may become conscious in the process of therapy or interrogation  as we come to realize how we feel or think about something  we should allow  then  that the domain of phenomenology — our own experience — spreads out from conscious experience into semi conscious and even unconscious mental activity  along with relevant background conditions implicitly invoked in ... <truncated>
)), .Names = "text", row.names = c(NA, -9L), class = "data.frame")

这是我用来获取bigram和trigram的代码:

library("tm")
library(slam)
df2 <- df$text
#use tm package
review_source <-  VectorSource(df2)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Functions
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min=2, max=2))}

# Bigrams
options(mc.cores=1)
dtm.docs.2g <- DocumentTermMatrix(docs.s, control=list(tokenize=BigramTokenizer))

# To get the bigram dist, we use the slam package for ops with simple triplet mat
sums.2g <- colapply_simple_triplet_matrix(dtm.docs.2g,FUN=sum)
sums.2g <- sort(sums.2g, decreasing=T)

当我输入控制台时,我预期的结果sums.2g是2个单词短语的结果,我只使用一个频率。这里是我提到的结果的示例输出。

  
    

sums.2g             体验有意识的现象学类型                     44 15 13 10                 域名第一人称                      6 5 5 5

  

1 个答案:

答案 0 :(得分:2)

你有

class(corpus)
# [1] "SimpleCorpus" "Corpus"

?DocumentTermMatrix解释了它如何处理<{p}的control参数

  

这对于SimpleCorpus来说是不同的。在这种情况下,所有选项都是   一次性以固定顺序处理以提高性能。它   总是使用Boost Tokenizer(通过Rcpp)并且不需要自定义   作为选项参数。

使用corpus <- VCorpus(review_source)