如何以编程方式从R中的UCI数据存储库获取数据集的头信息

时间:2012-11-08 17:36:21

标签: r dataset

我正在尝试为 R 公开收集datasets from UCI repository。我知道有很多数据集已经可用于几个 R 包,例如mlbench.但是我仍然需要从UCI存储库中获得几个数据集。

这是我学到的一个技巧

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
credit<-read.csv(url, header=F)

但是这不会获得标题(变量名称)信息。该信息以*.names文件格式显示。知道如何以编程方式获取标题信息吗?

2 个答案:

答案 0 :(得分:3)

我怀疑你必须使用正则表达式才能完成此任务。这是一个丑陋但通用的解决方案,应该适用于各种* .names文件,假设它们的格式类似于您发布的格式。

names.file.url <-'http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names' 
names.file.lines <- readLines(names.file.url)

# get run lengths of consecutive lines containing a colon.
# then find the position of the subgrouping that has a run length 
# equal to the number of columns in credit and sum the run lengths up 
# to that value to get the index of the last line in the names block.
end.of.names <- with(rle(grepl(':', names.file.lines)), 
                       sum(lengths[1:match(ncol(credit), lengths)]))

# extract those lines
names.lines <- names.file.lines[(end.of.names - ncol(credit) + 1):end.of.names]

# extract the names from those lines
names <- regmatches(names.lines, regexpr('(\\w)+(?=:)', names.lines, perl=TRUE))

# [1] "A1"  "A2"  "A3"  "A4"  "A5"  "A6"  "A7"  "A8"  "A9"  "A10" "A11"
# [12] "A12" "A13" "A14" "A15" "A16"

答案 1 :(得分:1)

我猜测Attribute Information必须是您指向的特定文件中的名称。这是一个非常非常脏的解决方案。我使用了一个事实,即有一个模式 - 你的名字后跟:,所以我们使用:scan分隔字符串,然后从原始向量中获取名称:

url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
credit<-read.csv(url, header=F)
url.names="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names"
mess <- scan(url.names, what="character", sep=":")
#your names are located from 31 to 61, every second place in the vector
mess.names <- mess[seq(31,61,2)]
names(credit) <- mess.names