将原始数据导入R

时间:2013-09-22 06:46:20

标签: r import read.table

请任何人都可以帮我从文本或dat文件中将这些数据导入R中。它有空格分隔,但城市名称不应被视为两个名称。像纽约一样。

1 NEW YORK  7,262,700
2 LOS ANGELES  3,259,340
3 CHICAGO  3,009,530
4 HOUSTON  1,728,910
5 PHILADELPHIA  1,642,900
6 DETROIT  1,086,220
7 SAN DIEGO  1,015,190
8 DALLAS  1,003,520
9 SAN ANTONIO  914,350
10 PHOENIX  894,070

3 个答案:

答案 0 :(得分:4)

对于您的特定数据框,其中真正的空格仅出现在大写字母之间,请考虑使用正则表达式:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK  7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO  3,009,530")
# [1] "3 CHICAGO  3,009,530"

然后,您可以将空格解释为字段分隔符。

答案 1 :(得分:4)

主题的变体......但首先是一些示例数据:

cat("1 NEW YORK  7,262,700",
    "2 LOS ANGELES  3,259,340",
    "3 CHICAGO  3,009,530",
    "4 HOUSTON  1,728,910",
    "5 PHILADELPHIA  1,642,900",
    "6 DETROIT  1,086,220",
    "7 SAN DIEGO  1,015,190",
    "8 DALLAS  1,003,520",
    "9 SAN ANTONIO  914,350",
    "10 PHOENIX  894,070", sep = "\n", file = "test.txt")

第1步 :使用readLines

读取数据
x <- readLines("test.txt")

第2步 :找出可用于插入分隔符的正则表达式。在这里,模式似乎是(从行的 end 看)一组数字和逗号,前面是空格,前面是所有大写中的一些单词。我们可以捕获这些组并插入一些“制表符”分隔符(\t)。额外的斜杠是为了正确地逃脱它们。

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
#  [1] "1\t NEW YORK  \t7,262,700"     "2\t LOS ANGELES  \t3,259,340" 
#  [3] "3\t CHICAGO  \t3,009,530"      "4\t HOUSTON  \t1,728,910"     
#  [5] "5\t PHILADELPHIA  \t1,642,900" "6\t DETROIT  \t1,086,220"     
#  [7] "7\t SAN DIEGO  \t1,015,190"    "8\t DALLAS  \t1,003,520"      
#  [9] "9\t SAN ANTONIO  \t914,350"    "10\t PHOENIX  \t894,070"  

第3步 :我们知道gsub正在运作,我们知道read.delim有“text”我们可以使用file直接对read.delim的结果使用<{1}}

的参数来代替“gsub”参数
out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
                  header = FALSE, strip.white = TRUE)
out
#    V1           V2        V3
# 1   1     NEW YORK 7,262,700
# 2   2  LOS ANGELES 3,259,340
# 3   3      CHICAGO 3,009,530
# 4   4      HOUSTON 1,728,910
# 5   5 PHILADELPHIA 1,642,900
# 6   6      DETROIT 1,086,220
# 7   7    SAN DIEGO 1,015,190
# 8   8       DALLAS 1,003,520
# 9   9  SAN ANTONIO   914,350
# 10 10      PHOENIX   894,070

最后一步可能是将第三列转换为数字:

out$V3 <- as.numeric(gsub(",", "", out$V3))

答案 2 :(得分:1)

扩展@ Hugh的回答我会尝试以下方法,虽然它不是特别有效。

lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) { 
  gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x) 
}))

citiesDF <- data.frame(num  = rep(0, length(lines)), 
                       city = rep("", length(lines)), 
                       population = rep(0, length(lines)),
                       stringsAsFactors=FALSE)

for (i in 1:length(lines)) {
   splitted = strsplit(lines[i], " +")
   citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
   citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
   citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}