将因子转换为字符,保留显示的信息?

时间:2019-10-08 09:11:07

标签: r web-scraping

我正在尝试收集瑞典的市政选举数据,并需要计算市政局中的政党人数。由于存在本地政党,因此所有主要政党的信息都显示在单独的单元格中,而较小的本地政党则在单独的列中显示。

当我刮擦桌子并只清除我需要的信息时,变量就是一个因素,这是我之前遇到的,通常只是转换为字符。

但是,当我在此处执行此操作时,它会破坏我想要保留的信息。

它没有显示Borås“ kommun”的“VÄG= 3”,而是显示“ c(ÖVR= 9)”并删除了我需要的信息,而我想作为NA的观察结果变成了“ c(ÖVR= 1)”。

我还尝试了sub(),试图在尝试转换为字符之前用NA替换空白的观察值,但是随后所有内容都变成了NA。

虽然最小的可重现样本是模拟数据的最佳选择,但我不考虑一种在不包含来源的情况下进行重现的方法,但是如果有人知道该方法,请告诉我以后的问题!

library(rvest) #For Web scraping
library(tidyverse) #For mainly pipes and filter function

#Official Swedish Election data
url <-"https://data.val.se/val/val2006/slutlig_ovrigt/statistik/kommun/mandat_kommun_parti.html" 
elections <- read_html(url) %>%
     html_table(header = TRUE, fill = TRUE)
elections <- elections[[1]]

# This is three different municipalities, one with one local party,
# one with no local party, and one with two local parties        
elections <- elections %>% filter(Kommun %in% c("Borås", "Eskilstuna", "Huddinge")) 

elections <- t(elections) #transpose so each municipality is a variable, and the parties are observations

elections <- elections[-nrow(elections),] #delete the total number of seats
elections <- elections[-1,] #Remove the municipalities  names
elections <- data.frame(elections) #convert into a data frame
row.names(elections) <- c() #remove the row names

others <- elections[nrow(elections),] #take the other parties
others <- as.character(others) #here everything goes wrong

对我来说,预期结果是将其转换为显示的信息,但将其转换为字符而不是因子水平,并且空的观察值将变为NA或我可以转换为NA的东西,但相反它将变为... “ c(ÖVR= X)”格式。

对于在哪里可以找到有关解决方法的信息的任何帮助或指导,将不胜感激!对于如何改善我的问题提出的任何批评也是如此!

谢谢。

1 个答案:

答案 0 :(得分:0)

您目前的方法有一些不建议执行的步骤。考虑以下替代方案,该方案可使数据保持整洁:

library(tidyr)
library(dplyr)
library(rvest)

url <-"https://data.val.se/val/val2006/slutlig_ovrigt/statistik/kommun/mandat_kommun_parti.html" #Official Swedish Election data

page <- read_html(url) 

page %>%
  html_table(header = TRUE, fill = TRUE) %>%
  first()  %>% 
  filter(Kommun %in% c("Borås", "Eskilstuna", "Huddinge")) %>%
  select(Kommun, x = ÖVR) %>%  # renamed ÖVR as encoding was producing weird results with separate_rows() 
  separate_rows(x, sep = ", ") %>%
  na_if("") %>%
  group_by(Kommun) %>%
  summarise(Count = sum(!is.na(x)))

  # A tibble: 3 x 2
  Kommun     Count
  <chr>      <int>
1 Borås          1
2 Eskilstuna     0
3 Huddinge       2

原始方法的主要问题是t()使数据成为字符矩阵,然后data.frame()默认将字符串转换为因数,随后您尝试将其转换为数据中的字符。框架对象而不是每个变量。因此,您可以这样做:

elections <- t(elections) #transpose so each municipality is a variable, and the parties are observations
elections <- elections[-nrow(elections),] #delete the total number of seats
elections <- elections[-1,] #Remove the municipalities  names
elections <- data.frame(elections, stringsAsFactors = FALSE, row.names = NULL) #convert into a data frame
others <- elections[nrow(elections),] #take the other parties