我正在读取一个看起来像这样的数据集:
我的代码如下:
NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
header= TRUE,
sep = "\t",
quote = "\"",
dec = ".",
fill = TRUE,
as.is = c("ParkName", "State"))
然后我得到如下警告:
警告消息: 1:在read.table(file = file,header = header,sep = sep,quote = quote,: readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整 2:在read.table(file = file,header = header,sep = sep,quote = quote,: 并非所有以“ as.is”命名的列都存在
因此我将“ header = TRUE”更改为“ header = FALSE”, 如下:
NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
header= FALSE,
sep = "\t",
quote = "\"",
dec = ".",
fill = TRUE,
as.is = c("ParkName", "State"))
我有同样的警告消息:
警告消息: 1:在read.table(file = file,header = header,sep = sep,quote = quote,: readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整 2:在read.table(file = file,header = header,sep = sep,quote = quote,: 并非所有以“ as.is”命名的列都存在
这次所有的行号都出现了,如下所示。 但是,我不明白str(NatPark)的含义。 那是什么“ v1”?然后是“ 4 1 5 2 3”? 感谢您的任何建议! 谢谢!
答案 0 :(得分:1)
我对.dat
文件的处理不多,但是如果您可以共享下载链接,则可以帮助您进行进一步的故障排除。到目前为止,我可以提供这些见解:
V1
(以及V2,V3,V4 ...)是指在没有标题的情况下R自动分配的列名。由于只有V1,因此R当然认为您只有1列具有当前设置。
从"4 1 5 2 3"
的输出中看到的str
指的是自该因子变量以来的数字级别(在这种情况下,整行被视为一个变量)。默认情况下,R始终按字母顺序对级别进行排序。来自虹膜数据集的示例应有助于阐明:
str(iris)
#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris$Species)
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
levels(iris$Species)
#> [1] "setosa" "versicolor" "virginica"
由reprex package(v0.2.0)于2018-08-18创建。
您可以看到值setosa
被认为是1
,因为它的第一级setosa
是2,而virginica
是3。但是,这应该都是有争议的一点,因为您不想将整行都读为一个变量。
答案 1 :(得分:1)
关于您的主要问题,我能够组合一个自定义函数来仅解析您的数据。将来,如果可以选择在源数据中引用文本,那么事情可能会简单得多。无论如何,希望这对您有用!您只需要设置列名称,然后将某些列从字符更改为数字即可。
library(tidyverse)
library(stringr)
directory <- "/Users/jas/Desktop"
filename <- "NatPark_Plus.dat"
file <- file.path(directory, filename)
# tabs
data <- read.delim(file, header = FALSE, sep = "\t")
#> Warning in read.table(file = file, header = header, sep = sep, quote =
#> quote, : incomplete final line found by readTableHeader on '/Users/jas/
#> Desktop/NatPark_Plus.dat'
# We have 5 records, but the spacing amongst them is uneven and some words with spaces
text <- data$V1
# Parse text to make same number of columns - 4
# Creates a separate dataframe for each row
parse_text_to_df <- function(x) {
# Find more than one spaces and replace with tab
x <- gsub("[ ]{2,}", "\t", x)
# replace remaining space with tab (cannot use comma since numbers have comma)
x <- gsub(" ", "\t", x)
# Should be only 3 tabs on each line - WORKS FOR THIS DATASET ONLY
total_tabs <- stringr::str_count(x, "\t")
# If we have those words with spaces, we need to remove the extra tabs between them
if (total_tabs[1] > 3) {
num_tabs_to_remove <- total_tabs - 3
for (i in range(num_tabs_to_remove)) {
x <- sub("\t", " ", x)
}
}
# Convert to an object that can be read back into a dataframe
x <- readLines(textConnection(x))
df <- read.delim(text = x, header = FALSE, sep = "\t") %>%
mutate_all(as.character)
return(df)
}
# Combine each of the 1 row dataframes into one dataframe (all character vectors)
df <- text %>% map_df(parse_text_to_df)
df
#> V1 V2 V3 V4
#> 1 Yellowstone ID/MT/WY 1872 4,065,493
#> 2 Everglades FL 1934 1,398,800
#> 3 Yosemite CA 1864 760,917
#> 4 Great Smoky Mountains NC/TN 1926 520,269
#> 5 Wolf Trap Farm VA 1966 130
由reprex package(v0.2.0)于2018-08-18创建。