我希望将.php链接中的数据传输到R数据帧,但是我很难这样做。
到目前为止的尝试:
require(XML)
data <- xmlParse("http://www.mahdial-husseini.com/xmlthing.php ")
xml_data <- xmlToList(data)
我得到的错误:错误:1:无法加载HTTP资源
此外(更具概念性),我不太了解链接的性质。这个XML数据是否在php文件中,如果是这样,当使用R来收集数据时,我会将其视为XML还是PHP?谢谢
答案 0 :(得分:3)
或者,可能是可读的东西:
library(xml2)
library(tidyverse)
这有助于制作更好的列名:
mcga <- function(tbl) {
x <- colnames(tbl)
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
colnames(tbl) <- x
tbl
}
这是自动计算出来的,但在找出它之后定义它很好,因为它有助于数据的一致性:
cols(
.default = col_integer(),
site = col_character(),
aod_47 = col_double(),
omi_aot = col_double(),
omi_no2 = col_double(),
fit = col_double(),
lng = col_double(),
lat = col_double()
) -> xdf_cols
现在工作:
doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")
xml_find_all(doc, ".//PPM1_0") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga() %>%
type_convert(col_types = xdf_cols) -> xdf
type_convert()
并非完全必要,但它 - 与列定义 - 使结果保持一致。
结果:
xdf
## # A tibble: 8 x 21
## sample site month day year hour jd doy pm25_hourly aod_47 omi_aot omi_no2 fit res
## <int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0 duluth 0 0 0 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 2 19 <NA> 12 0 2004 5 0 0 30 0.000 0.000 0.000 0.00000 0
## 3 4545 Sarasota 4 0 2017 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 4 11111 Atlanta 10 1 2004 13 2453280 275 23 0.379 0.148 0.274 16.01850 NA
## 5 11112 Birmingham 10 2 2008 14 2453281 276 0 0.000 0.000 0.000 19.19440 0
## 6 11113 Savannah 10 3 2004 13 2453282 277 15 0.181 0.133 0.127 9.00433 NA
## 7 11114 Fort Knox 6 20 2017 21 0 301 18 0.000 0.000 0.000 0.00000 0
## 8 63738 Fort Rucker 1 0 2015 0 0 0 40 0.000 0.000 0.000 0.00000 0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## # signal_received <int>, temp_c <int>
完整结构:
glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47 <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2 <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c <int> 0, 0, 0, 0, 0, 0, 0, 0
答案 1 :(得分:2)
你可以使用rvest包(和方便的data.table)
library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>%
xml_attrs(),
function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE],
as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt
site sample month day year hour jd doy pm25_hourly aod_47
1: duluth 0 0 0 0 0 0 0 0 0.000
2: 19 12 0 2004 5 0 0 30 0.000
3: Sarasota 4545 4 0 2017 0 0 0 0 0.000
4: Atlanta 11111 10 1 2004 13 2453280 275 23 0.379
5: Birmingham 11112 10 2 2008 14 2453281 276 0 0.000
6: Savannah 11113 10 3 2004 13 2453282 277 15 0.181
7: Fort Knox 11114 6 20 2017 21 0 301 18 0.000
8: Fort Rucker 63738 1 0 2015 0 0 0 40 0.000
omi_aot omi_no2 fit res lng lat rel_humid altitude pressure
1: 0.000 0.000 0.00000 0 84.1000 34.0000 0 0 0
2: 0.000 0.000 0.00000 0 63.6167 38.4161 0 0 0
3: 0.000 0.000 0.00000 0 -82.5300 27.3300 0 0 0
4: 0.148 0.274 16.01850 NA -84.7000 33.7500 0 0 0
5: 0.000 0.000 19.19440 0 -86.8000 33.5200 0 0 0
6: 0.133 0.127 9.00433 NA -81.1000 32.0800 0 0 0
7: 0.000 0.000 0.00000 0 -85.9500 37.9100 0 0 0
8: 0.000 0.000 0.00000 0 -85.7000 31.3400 0 0 0
signal_received temp_c
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0