R数据帧中十个最高列值

时间:2013-12-23 16:24:55

标签: r max tm keyword-search

目前,我正在开发一个项目,用于从文本块中提取关键字。 以下是初始列表中前三项的示例。 (为长度道歉)

descriptest<-c("Columbia University is one of the world's most important centers of research and at the same time a distinctive and distinguished learning environment for undergraduates and graduate students in many scholarly and professional fields. The University recognizes the importance of its location in New York City and seeks to link its research and teaching to the vast resources of a great metropolis. It seeks to attract a diverse and international faculty and student body, to support research and teaching on global issues, and to create academic relationships with many countries and regions. It expects all areas of the university to advance knowledge and learning at the highest level and to convey the products of its efforts to the world.", 
"", "UMass Amherst was born in 1863 as a land-grant agricultural college set on 310 rural acres with four faculty members, four wooden buildings, 56 students and a curriculum combining modern farming, science, technical courses, and liberal arts.\n\nOver time, the curriculum, facilities, and student body outgrew the institution's original mission. In 1892 the first female student enrolled and graduate degrees were authorized. By 1931, to reflect a broader curriculum, \"Mass Aggie\" had become Massachusetts State College. In 1947, \"Mass State\" became the University of Massachusetts at Amherst.\n\nImmediately after World War II, the university experienced rapid growth in facilities, programs and enrollment, with 4000 students in 1954. By 1964, undergraduate enrollment jumped to 10,500, as Baby Boomers came of age. The turbulent political environment also brought a \"sit-in\" to the newly constructed Whitmore Administration Building. By the end of the decade, the completion of Southwest Residential Complex, the Alumni Stadium and the establishment of many new academic departments gave UMass Amherst much of its modern stature.\n\nIn the 1970s continued growth gave rise to a shuttle bus service on campus as well as several important architectural additions: the Murray D. Lincoln Campus Center, with a hotel, office space, fine dining restaurant, campus store and passageway to a multi-level parking garage; the W.E.B. Du Bois Library, named \"tallest library in the world\" upon its completion in 1973; and the Fine Arts Center, with performance space for world-class music, dance and theater.\n\nThe next two decades saw the emergence of UMass Amherst as a major research facility with the construction of the Lederle Graduate Research Center and the Conte National Polymer Research Center. Other programs excelled as well. In 1996 UMass Basketball became Atlantic 10 Conference champs and went to the NCAA Final Four. Before the millennium, both the William D. Mullins Center, a multi-purpose sports and convocation facility, and the Paul Robsham Visitors Center bustled with activity, welcoming thousands of visitors to the campus each year.\n\nUMass Amherst entered the 21st century as the flagship campus of the state's five-campus University system, and enrollment of nearly 24,000 students and a national and international reputation for excellence.")

我希望在R中使用tm包执行此操作,因为DocumentTermMatrix在处理大数据时是一个清晰的矩阵。另外,我使用TfIdf的权重来对语料库中的关键字进行排名,并与条目本身中的关键字进行比较。

我卡住了,因为我可以使用max.col来获取最大关键字,但是,我的矩阵有多个具有相同值的最大值,而且,我不仅想要最大值,我真的希望前十名最高列表中的值。 以下是示例代码:

 library(RWeka)
 library(tm)
 library(koRpus)
 library(RKEA)
 library(corpora)
 library(wordcloud)
 library(plyr)
changeindextoname<-function(indexnumber){
name<-colnames(z2[indexnumber])
return(name)
}

removestuff<- function(d){
d <- tm_map(d, tolower)
d <- tm_map(d, removePunctuation)
d <- tm_map(d, removeNumbers)
d <- tm_map(d, stripWhitespace)
d <- tm_map(d, skipWords)
d <- tm_map(d, removeWords, stopwords('english'))
}

descripcorpora<-Corpus(VectorSource(descriptest))
descripcorpora<-removestuff(descripcorpora)
ddtm <- DocumentTermMatrix(descripcorpora, control = list(weighting=weightTfIdf, stopwords=T))
f2<-as.data.frame(inspect(ddtm))
z2<-f2
z3<-max.col(z2)
dfwithmax<-cbind(z3, z2)
dfwithmax$word<-lapply(dfwithmax$z3, changeindextoname)
finaldf<-subset(dfwithmax, select=c("z3", "word", "learning", "tallest", "center", "seeks", "teaching"))

finaldf如下所示:

finaldf
  z3     word   learning     tallest     center      seeks   teaching
1 106 learning 0.04953008 0.000000000 0.00000000 0.04953008 0.04953008
2 183  tallest 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
3  35   center 0.00000000 0.007204375 0.04322625 0.00000000 0.00000000

这种方法似乎有效,但是,在第1行中不能适应“寻求”和“学习”和“教学”都具有相同价值的事实。

此外,max.col返回所有列为零时的索引(如第2行)。我怎么能摆脱这个呢?

我试图远离循环遍历列或行,因为它需要很长时间,因为矩阵非常大。

我非常感谢有关如何编写可以应用或循环遍历每个列并将其添加到列表的函数的任何建议或想法,然后我可以应用changeindextoname函数并在列表中返回列名。

提前谢谢!

1 个答案:

答案 0 :(得分:2)

对于每个文档,前五个最高值:

apply(as.matrix(ddtm),1,function(x) 
         colnames(as.matrix(ddtm))[order(x,decreasing=TRUE)[1:5]])

  Docs
       1            2            3        
  [1,] "teaching"   "york"       "center" 
  [2,] "seeks"      "year"       "umass"  
  [3,] "learning"   "worlds"     "campus" 
  [4,] "university" "worldclass" "amherst"
  [5,] "research"   "world"      "four"   

请注意,您不提供skipWords的代码,因此我使用此代码:

skipWords <- function(x) removeWords(x, c(stopwords("english")

请参阅tm_reduce重写removestuff函数:

removestuff <- tm_reduce(x,list(tolower,removePunctuation,...)