我非常需要您的帮助。我从维基百科上抓取了一些数据,然后遇到了这个标志。起初我以为只是但显然不是。
我的大多数细胞看起来像这样
table$Population
7004164110000000000¦16,411[7]
7007111260000000000¦11,126,000[13]
我正在尝试删除除16,411以外的所有内容,但首先我需要了解如何将其转换为其他内容。
任何帮助表示赞赏,我都疯了,因为当我尝试gsub函数时它不起作用,然后str_split_fixed一个也不起作用...
dput(tables$Population)
给出
c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]")
答案 0 :(得分:3)
这是将表解析为数据帧的另一种方法:
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/List_of_cities_proper_by_population")
html_node(pg, "table.wikitable") %>%
html_table() %>%
dplyr::tbl_df() %>%
janitor::clean_names() %>% # THE LINE BELOW DOES THE MAGIC YOU ORIGINALLY ASKE FOR BUT IN A DIFFERENT WAY
tidyr::separate(population, c("sortkey", "population"), sep="[^[:ascii:]]+") %>%
dplyr::mutate(
population = gsub("\\[.*$", "", population)
) %>%
readr::type_convert()
## # A tibble: 87 x 9
## rank city image sortkey population definition totalarea_km populationdensi… country
## <int> <chr> <lgl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 Chongqing NA 7.01e18 30165500. Municipality 700482403000… 366. China
## 2 2 Shanghai NA 7.01e18 24183300. Municipality 700363405000… 3814. China
## 3 3 Beijing NA 7.01e18 21707000. Municipality 700416411000… 1267. China
## 4 4 Istanbul NA 7.01e18 15029231. Metropolitan municipality 700262029000… 24231. Turkey
## 5 5 Karachi NA 7.01e18 14910352. City[14] 700337800000… 3944. Pakist…
## 6 6 Dhaka NA 7.01e18 14399000. City 700233754000… 42659. Bangla…
## 7 7 Guangzhou NA 7.01e18 13081000. City (sub-provincial) 700374340000… 1760. China
## 8 8 Shenzhen NA 7.01e18 12528300. City (sub-provincial) 700319920000… 6889. China
## 9 9 Mumbai NA 7.01e18 12442373. City[21] 700243771000… 28426. India
## 10 10 Moscow NA 7.01e18 13200000. Federal city[24][25] 2 511[26] 5256. Russia
## # ... with 77 more rows
该表对行使用以下基础标记:
“人口”细胞最终在R原始向量中看起来像这样(这是第一个,30
==提供可视标记参考的空间):
## [1] 37 30 30 37 33 30 31 36 35 35 30 30 30 30 30 30 30 30 30 e2 99 a0 33 30 2c 31 36 35 2c 35 30 30 5b 36 5d
看起来更像是unicode嵌入。由于它是“非ASCII”,因此我们可以利用它来整理数据。
答案 1 :(得分:2)
您需要使用\\
test <- "7004164110000000000¦16,411"
gsub("\\¦", "", test)
[1] "700416411000000000016,411"
编辑:是的,它也适用于该列:
> gsub("\\¦","",c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"))
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"
EDIT2:按照@hrbrmstr的建议替换字符,以下内容将为您工作:
stringr::str_replace(c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]"),
+ "[^[:ascii:]]+","")
[1] "700730165500000000030,165,500[6]" "700724183300000000024,183,300[8]"
[3] "700721707000000000021,707,000[10]" "700715029231000000015,029,231[11]"