第一个观测值没有行号

时间:2018-08-19 00:14:50

标签: r

我正在读取一个看起来像这样的数据集:

enter image description here

我的代码如下:

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                  header= TRUE, 
                  sep = "\t",
                  quote = "\"",
                  dec = ".",
                  fill = TRUE,
                  as.is = c("ParkName", "State"))

然后我得到如下警告:

  

警告消息:   1:在read.table(file = file,header = header,sep = sep,quote = quote,:     readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整   2:在read.table(file = file,header = header,sep = sep,quote = quote,:     并非所有以“ as.is”命名的列都存在

因此我将“ header = TRUE”更改为“ header = FALSE”, 如下:

   NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                    header= FALSE, 
                      sep = "\t",
                      quote = "\"",
                      dec = ".",
                      fill = TRUE,
                      as.is = c("ParkName", "State"))

我有同样的警告消息:

  

警告消息:   1:在read.table(file = file,header = header,sep = sep,quote = quote,:     readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整   2:在read.table(file = file,header = header,sep = sep,quote = quote,:     并非所有以“ as.is”命名的列都存在

这次所有的行号都出现了,如下所示。 但是,我不明白str(NatPark)的含义。 那是什么“ v1”?然后是“ 4 1 5 2 3”? 感谢您的任何建议! 谢谢!

enter image description here

2 个答案:

答案 0 :(得分:1)

我对.dat文件的处理不多,但是如果您可以共享下载链接,则可以帮助您进行进一步的故障排除。到目前为止,我可以提供这些见解:

  • V1(以及V2,V3,V4 ...)是指在没有标题的情况下R自动分配的列名。由于只有V1,因此R当然认为您只有1列具有当前设置。

  • "4 1 5 2 3"的输出中看到的str指的是自该因子变量以来的数字级别(在这种情况下,整行被视为一个变量)。默认情况下,R始终按字母顺序对级别进行排序。来自虹膜数据集的示例应有助于阐明:

str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris$Species)
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"

reprex package(v0.2.0)于2018-08-18创建。

您可以看到值setosa被认为是1,因为它的第一级setosa是2,而virginica是3。但是,这应该都是有争议的一点,因为您不想将整行都读为一个变量。

答案 1 :(得分:1)

关于您的主要问题,我能够组合一个自定义函数来仅解析您的数据。将来,如果可以选择在源数据中引用文本,那么事情可能会简单得多。无论如何,希望这对您有用!您只需要设置列名称,然后将某些列从字符更改为数字即可。

library(tidyverse)
library(stringr)

directory <- "/Users/jas/Desktop"
filename <- "NatPark_Plus.dat"
file <- file.path(directory, filename)

# tabs
data <- read.delim(file, header = FALSE, sep = "\t")
#> Warning in read.table(file = file, header = header, sep = sep, quote =
#> quote, : incomplete final line found by readTableHeader on '/Users/jas/
#> Desktop/NatPark_Plus.dat'

# We have 5 records, but the spacing amongst them is uneven and some words with spaces

text <- data$V1

# Parse text to make same number of columns - 4
# Creates a separate dataframe for each row
parse_text_to_df <- function(x) {
  # Find more than one spaces and replace with tab
  x <- gsub("[ ]{2,}", "\t", x)
  # replace remaining space with tab (cannot use comma since numbers have comma)
  x <- gsub(" ", "\t", x)
  # Should be only 3 tabs on each line - WORKS FOR THIS DATASET ONLY
  total_tabs <- stringr::str_count(x, "\t")
  # If we have those words with spaces, we need to remove the extra tabs between them
  if (total_tabs[1] > 3) {
    num_tabs_to_remove <- total_tabs - 3
    for (i in range(num_tabs_to_remove)) {
      x <- sub("\t", " ", x)
    }
  }
  # Convert to an object that can be read back into a dataframe
  x <- readLines(textConnection(x))
  df <- read.delim(text = x, header = FALSE, sep = "\t") %>%
    mutate_all(as.character)
  return(df)
}

# Combine each of the 1 row dataframes into one dataframe (all character vectors)
df <- text %>% map_df(parse_text_to_df)
df
#>                      V1       V2   V3        V4
#> 1           Yellowstone ID/MT/WY 1872 4,065,493
#> 2            Everglades       FL 1934 1,398,800
#> 3              Yosemite       CA 1864   760,917
#> 4 Great Smoky Mountains    NC/TN 1926   520,269
#> 5        Wolf Trap Farm       VA 1966       130

reprex package(v0.2.0)于2018-08-18创建。