我正在尝试分析Airbnb
列表中的大型数据集,并在amenities
列中列出商家信息所包含的便利设施。
例如,
{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire
extinguisher",Essentials,Shampoo,Hangers}
和
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
我有两个问题需要解决:
我想将字符串拆分为不同的列,例如会有一个标题为TV
的列。如果字符串包含TV
,则相应单元格中的条目将为1,否则为0。我怎么能这样做?
如何删除包含translation missing:.....
?
答案 0 :(得分:0)
我相信这将是解决问题的快速解决方案:
library(data.table)
setDT(df)
dcast(df, listing_id~amenities)
答案 1 :(得分:0)
这是来自Kaggle的Boston Airbnb开放数据吗? 这是一种方式。不完全漂亮,但似乎有效:
我们的想法是删除{
和}
,然后使用read_csv()
来解析字符串。
然后,列出独特的设施,并为每个设施列出一个列:
library(dplyr)
library(readr)
listings <- read_csv(file = "../data/boston-airbnb-open-data/listings.csv")
parsed_amenities <-
listings %>%
.$amenities %>%
sub("^\\{(.*)\\}$", "\\1\n", x = .) %>%
lapply(function(x) names(read_csv(x)))
df <-
unique(unlist(parsed_amenities)) %>%
.[!grepl("translation missing", .)] %>%
setNames(., .) %>%
lapply(function(x) vapply(parsed_amenities, "%in%", logical(1), x = x)) %>%
as_data_frame()
df
# # A tibble: 3,585 × 43
# TV `Wireless Internet` Kitchen `Free Parking on Premises` `Pets live on this property` `Dog(s)` Heating
# <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# 2 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# 3 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 4 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 5 FALSE TRUE TRUE FALSE FALSE FALSE TRUE
# 6 FALSE TRUE TRUE TRUE TRUE FALSE TRUE
# 7 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 8 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
# 9 FALSE TRUE FALSE FALSE TRUE FALSE TRUE
# 10 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# # ... with 3,575 more rows, and 36 more variables: `Family/Kid Friendly` <lgl>, Washer <lgl>, Dryer <lgl>, `Smoke
# # Detector` <lgl>, `Fire Extinguisher` <lgl>, Essentials <lgl>, Shampoo <lgl>, `Laptop Friendly Workspace` <lgl>,
# # Internet <lgl>, `Air Conditioning` <lgl>, `Pets Allowed` <lgl>, `Carbon Monoxide Detector` <lgl>, `Lock on Bedroom
# # Door` <lgl>, Hangers <lgl>, `Hair Dryer` <lgl>, Iron <lgl>, `Cable TV` <lgl>, `First Aid Kit` <lgl>, `Safety
# # Card` <lgl>, Gym <lgl>, Breakfast <lgl>, `Indoor Fireplace` <lgl>, `Cat(s)` <lgl>, `24-Hour Check-in` <lgl>, `Hot
# # Tub` <lgl>, `Buzzer/Wireless Intercom` <lgl>, `Other pet(s)` <lgl>, `Washer / Dryer` <lgl>, `Smoking
# # Allowed` <lgl>, `Suitable for Events` <lgl>, `Wheelchair Accessible` <lgl>, `Elevator in Building` <lgl>,
# # Pool <lgl>, Doorman <lgl>, `Paid Parking Off Premises` <lgl>, `Free Parking on Street` <lgl>
答案 2 :(得分:0)
这是一种方法,它同时使用dcast()
包中data.table
library(data.table)
# read data file, returning one column
raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities")
# add column with row numbers
raw[, rn := seq_len(.N)]
# remove opening and closing curly braces
raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")]
# split amenities, thereby reshaping from wide to long format
long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn]
# remove double quotes and leading and trailing whitespace
long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))]
# reshape from long to wide format, omitting rows which contain "translation missing..."
dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0)
# rn Air conditioning Carbon monoxide detector Elevator in building Essentials
#1: 1 1 0 0 1
#2: 2 1 1 1 1
# Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen
#1: 1 0 0 1 1 0 1
#2: 0 1 1 1 1 1 1
# Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector
#1: 0 0 1 0
#2: 1 1 1 1
# Suitable for events TV Wireless Internet
#1: 0 0 1
#2: 1 1 1
的答案,但也解决了数据清理的繁琐但重要的细节。
"AirBnB.csv"
OP只提供了两个数据样本,这些样本已被复制到名为{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers}
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
的数据文件中:
{{1}}