在R

时间:2017-10-10 04:49:20

标签: r

我无法成功加载Qld + 20-34 + Age + Groups.zip文件中包含的数据,该文件位于...

https://github.com/SuperSi2217/datasample

我已在txt编辑器中打开文件并删除了不需要的标题和尾部行。我尝试了各种read_csvread.csv组合来导入它,但它总是在数据集末尾添加一个额外的列,该列填充了NA s。我尝试将其转换为文本文件并使用read_delimread.table,但仍然遇到同样的问题。

df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", col_names = FALSE, quote = "\"", na = c("", "NA"), delim = ",")
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_integer(),
  X4 = col_integer(),
  X5 = col_integer(),
  X6 = col_integer(),
  X7 = col_character()
)
Warning: 1 parsing failure.
row 
# A tibble: 1 x 5 
col       row   col  expected    actual expected     
<int> <chr>     <chr>     <chr> 
actual 1 1423530  <NA> 7 columns 6 columns file 
# ... with 1 more variables: file <chr>

df <- read_delim("C:/Qld 20-34 Age Groups Clean.txt", delim = ",", col_names = FALSE, quote = "\"", na = c("", "NA"))
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_integer(),
  X4 = col_integer(),
  X5 = col_integer(),
  X6 = col_integer(),
  X7 = col_character()
)
|========================================================| 100%   29 MB

df <- read_csv("C:/qldtest.csv", col_names = TRUE)
Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X6 = col_integer()
)

上面导入数据但是有一个额外的列。当我尝试使用它时,它会产生奇怪的事情 - 见下文。为了得到它我需要使用的三列......

df <- df %>% 
       select(X1, X2, X6)

最终我需要数据看起来像......

X1    | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD        |0
na        |4033,QLD        |0
na        |4034,QLD        |10
na        |4035,QLD        |0
Smith Town|4032,QLD        |0
na        |4033,QLD        |220
na        |4034,QLD        |0
na        |4035,QLD        |0
然后我跑了......

transform(df, X1 = na.locf(Suburb))

...填写第一列中的最后一个已知值,使其变为......

X1    | X2 | X6
----------|----------------|------
Abbotsbury|4032,QLD        |0
Abbotsbury|4033,QLD        |0
Abbotsbury|4034,QLD        |10
Abbotsbury|4035,QLD        |0
Smith Town|4032,QLD        |0
Smith Town|4033,QLD        |220
Smith Town|4034,QLD        |0
Smith Town|4035,QLD        |0

这样可行,但有以下警告......

+ transform(df, X1 = na.locf(df))
Warning messages:
1: In is.na(object) :
  is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(object[1L]) :
  is.na() applied to non-(list or vector) of type 'NULL'

那就是说,数据看起来是正确的。

但是,当我运行以下操作时,只选择那些X6列为&gt;的记录。 0,R明显增加了另外四列,但全局环境中的变量数仍然表示3 ??

df1 <- df %>%
        filter(X6 > 0)

......这些看起来像

X1.X1.X1  |X1.X1.X2|X1.X1.X6|X1.X2   |X1.X6|X2      |X6
----------|--------|--------|--------|-----|--------|--
Abbotsbury|4613,QLD|3       |4613,QLD|3    |4613,QLD|3

我做错了什么?任何帮助表示赞赏。

文件的前几行看起来像附加的图像。

Capture

Capture1

2 个答案:

答案 0 :(得分:1)

如果您在Sublime等文本编辑器中打开文件,您会看到每行后都有一个逗号:

screenshot_csv

这就是为什么有一个额外的列。

我认为您不需要数据上方的信息,因此我建议使用skip = 11来读取数据。由于数据下方有免责声明,您可以使用n_max通过限制读取的行数来排除它。

library(readr)
file <- "C:/Qld 20-34 Age Groups Clean.txt"
df <- read_delim(file, col_names = FALSE, quote = "\"", na = c("", "NA"), 
                 delim = ",", skip = 11, n_max = 1423540)
df$X7 <- NULL
head(df, n = 5)
# A tibble: 5 x 6
     X1        X2    X3    X4    X5    X6
      <chr>     <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD     0     0     0     0
2      <NA> 4005, QLD     0     0     0     0
3      <NA> 4006, QLD     0     0     0     0
4      <NA> 4007, QLD     0     0     0     0
5      <NA> 4008, QLD     0     0     0     0

为了用最新的非NA替换NA,您可以使用

df <- df %>% 
    mutate(X1 = na.locf(df$X1))

head(df, n = 5)
# A tibble: 5 x 6
         X1        X2    X3    X4    X5    X6
      <chr>     <chr> <int> <int> <int> <int>
1 Abbeywood 4000, QLD     0     0     0     0
2      <NA> 4005, QLD     0     0     0     0
3      <NA> 4006, QLD     0     0     0     0
4      <NA> 4007, QLD     0     0     0     0
5      <NA> 4008, QLD     0     0     0     0

答案 1 :(得分:0)

如果只跳过前9行并使用文件的正常标题呢?

这样的事情:

jnk <- 
  read.csv('~/Downloads/Qld 20-34 Age Groups.csv', skip=9, stringsAsFactors=FALSE)

你可以看看

summary(jnk)

例如使用您的df %>% filter(X6 > 0)命令,看起来像这样

head(jnk %>% filter(Total > 0))

或者我是否错过了这个问题中的一些重点?