我无法成功加载Qld + 20-34 + Age + Groups.zip文件中包含的数据,该文件位于...
https://github.com/SuperSi2217/datasample
我已在txt编辑器中打开文件并删除了不需要的标题和尾部行。我尝试了各种read_csv
和read.csv
组合来导入它,但它总是在数据集末尾添加一个额外的列,该列填充了NA
s。我尝试将其转换为文本文件并使用read_delim
和read.table
,但仍然遇到同样的问题。
df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", col_names = FALSE, quote = "\"", na = c("", "NA"), delim = ",")
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer(),
X7 = col_character()
)
Warning: 1 parsing failure.
row
# A tibble: 1 x 5
col row col expected actual expected
<int> <chr> <chr> <chr>
actual 1 1423530 <NA> 7 columns 6 columns file
# ... with 1 more variables: file <chr>
df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", delim = ",", col_names = FALSE, quote = "\"", na = c("", "NA"))
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_integer(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_integer(),
X7 = col_character()
)
|========================================================| 100% 29 MB
df <- read_csv("C:/qldtest.csv", col_names = TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X6 = col_integer()
)
上面导入数据但是有一个额外的列。当我尝试使用它时,它会产生奇怪的事情 - 见下文。为了得到它我需要使用的三列......
df <- df %>%
select(X1, X2, X6)
最终我需要数据看起来像......
X1 | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD |0
na |4033,QLD |0
na |4034,QLD |10
na |4035,QLD |0
Smith Town|4032,QLD |0
na |4033,QLD |220
na |4034,QLD |0
na |4035,QLD |0
然后我跑了......
transform(df, X1 = na.locf(Suburb))
...填写第一列中的最后一个已知值,使其变为......
X1 | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD |0
Abbotsbury|4033,QLD |0
Abbotsbury|4034,QLD |10
Abbotsbury|4035,QLD |0
Smith Town|4032,QLD |0
Smith Town|4033,QLD |220
Smith Town|4034,QLD |0
Smith Town|4035,QLD |0
这样可行,但有以下警告......
+ transform(df, X1 = na.locf(df))
Warning messages:
1: In is.na(object) :
is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(object[1L]) :
is.na() applied to non-(list or vector) of type 'NULL'
那就是说,数据看起来是正确的。
但是,当我运行以下操作时,只选择那些X6列为&gt;的记录。 0,R明显增加了另外四列,但全局环境中的变量数仍然表示3 ??
df1 <- df %>%
filter(X6 > 0)
......这些看起来像
X1.X1.X1 |X1.X1.X2|X1.X1.X6|X1.X2 |X1.X6|X2 |X6
----------|--------|--------|--------|-----|--------|--
Abbotsbury|4613,QLD|3 |4613,QLD|3 |4613,QLD|3
我做错了什么?任何帮助表示赞赏。
文件的前几行看起来像附加的图像。
答案 0 :(得分:1)
如果您在Sublime等文本编辑器中打开文件,您会看到每行后都有一个逗号:
这就是为什么有一个额外的列。
我认为您不需要数据上方的信息,因此我建议使用skip = 11
来读取数据。由于数据下方有免责声明,您可以使用n_max
通过限制读取的行数来排除它。
library(readr)
file <- "C:/Qld 20-34 Age Groups Clean.txt"
df <- read_delim(file, col_names = FALSE, quote = "\"", na = c("", "NA"),
delim = ",", skip = 11, n_max = 1423540)
df$X7 <- NULL
head(df, n = 5)
# A tibble: 5 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD 0 0 0 0
2 <NA> 4005, QLD 0 0 0 0
3 <NA> 4006, QLD 0 0 0 0
4 <NA> 4007, QLD 0 0 0 0
5 <NA> 4008, QLD 0 0 0 0
为了用最新的非NA替换NA,您可以使用
df <- df %>%
mutate(X1 = na.locf(df$X1))
head(df, n = 5)
# A tibble: 5 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD 0 0 0 0
2 <NA> 4005, QLD 0 0 0 0
3 <NA> 4006, QLD 0 0 0 0
4 <NA> 4007, QLD 0 0 0 0
5 <NA> 4008, QLD 0 0 0 0
答案 1 :(得分:0)
如果只跳过前9行并使用文件的正常标题呢?
这样的事情:
jnk <-
read.csv('~/Downloads/Qld 20-34 Age Groups.csv', skip=9, stringsAsFactors=FALSE)
你可以看看
summary(jnk)
例如使用您的df %>% filter(X6 > 0)
命令,看起来像这样
head(jnk %>% filter(Total > 0))
或者我是否错过了这个问题中的一些重点?