从R中列出和描述CRAN中的所有包

时间:2012-07-19 12:25:54

标签: r

我可以使用该函数获取所有可用包的列表:

ap <- available.packages()

但是我怎么能从R中获得这些包的描述,所以我可以有一个data.frame有两列:包和描述?

3 个答案:

答案 0 :(得分:15)

我实际上认为你想要“包”和“标题”作为“描述”可以运行到几行。所以这是前者,如果真的想要“描述”,只需将“描述”放在最后一个子集中:

R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+     contrib.url(getOption("repos")["CRAN"], "source") 
+     description <- sprintf("%s/web/packages/packages.rds", 
+                            getOption("repos")["CRAN"])
+     con <- if(substring(description, 1L, 7L) == "file://") {
+         file(description, "rb")
+     } else {
+         url(description, "rb")
+     }
+     on.exit(close(con))
+     db <- readRDS(gzcon(con))
+     rownames(db) <- NULL
+
+     db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle())               # I shortened one Title here...
     Package              Title
[1,] "abc"                "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA"           "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd"                "The Analysis of Biological Data"
[4,] "abind"              "Combine multi-dimensional arrays"
[5,] "abn"                "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>

答案 1 :(得分:7)

Dirk提供了一个非常好的答案,在完成我的解决方案之后,然后看到我辩论了一段时间后发布了我的解决方案,因为害怕看起来很傻。但我决定发布它有两个原因:

  1. 开始像我这样的抓手
  2. 我花了一段时间才做,所以为什么不这样做:)。
  3. 我接近这个想法,我需要做一些网络报废,并选择crantastic作为刮去的网站。首先,我将提供代码,然后提供两个对我有帮助的资源:

    library(RCurl)
    library(XML)
    
    URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
    packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T, 
        strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999", 
            "NA", " "))[, 1])
    Trim <- function(x) {
        gsub("^\\s+|\\s+$", "", x)
    }
    packs <- unique(Trim(packs))
    u1 <- "http://crantastic.org/packages/"
    len.samps <- 10 #for demo purpose; use:
    #len.samps <- length(packs) # for all of them
    URL2 <- paste0(u1, packs[seq_len(len.samps)]) 
    scraper <- function(urls){ #function to grab description
        doc   <- htmlTreeParse(urls, useInternalNodes=TRUE)
        nodes <- getNodeSet(doc, "//p")[[3]]
        return(nodes)
    }
    info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
    info2 <- sapply(info, function(x) { #replace errors with NA
            if(class(x)[1] != "XMLInternalElementNode"){
                NA
            } else {
                Trim(gsub("\\s+", " ", xmlValue(x)))
            }
        }
    )
    pack_n_desc <- data.frame(package=packs[seq_len(len.samps)], 
        description=info2) #make a dataframe of it all
    

    <强>资源:

    1. talkstats.com thread on web scraping (great beginner examples)
    2. w3schools.com site on html stuff (very helpful)

答案 2 :(得分:1)

我想尝试使用HTML抓取工具(rvest)来执行此操作,因为OP中的available.packages()不包含软件包 Descriptions 。 / p>

library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)

P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE)  # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE)  # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""]  # Remove NULL and empty ("") items
length(P1); length(P2);

mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")

# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)

# PackageName Description                                                             
# A3          Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR      Access to Abbyy Optical Character Recognition (OCR) API                 
# abc         Tools for Approximate Bayesian Computation (ABC)                        
# abc.data    Data Only: Tools for Approximate Bayesian Computation (ABC)             
# ABC.RAP     Array Based CpG Region Analysis Pipeline                                
# ABCanalysis Computed ABC Analysis

# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)

但是,尽管对于较小的数组/数据框列表(子集)来说效果很好,但我遇到了整个列表的显示问题,其中的数据会以逐列或逐列显示。不结盟。我很高兴能将此 paged 并以某种方式在新窗口中正确设置格式。我尝试使用 page ,但无法使其正常运行。


编辑: 推荐的方法不是上述方法,而是使用Dirk的建议(来自以下注释):

db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)

但是,这仍然遭受提到的显示问题...