我希望能够隔离Instagram用户的用户名,我一直使用@符号作为识别pagesource中用户的方法。
我的问题是,当用户的名字包含。时,我的代码会从中移除所有内容。给我不完整的用户名。
我需要的是完整的用户名,即保留所有用户名,直到出现空格。
web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(web_page_read) <- "id"
web_page_collect <- web_page_read[web_page_read$id %like% '@',]
web_page_collect <- as.data.frame(web_page_collect)
colnames(web_page_collect) <- "id"
web_page_collect$id <- str_extract(web_page_collect$id, "(?<=@)\\w+")
web_page_collect$id <- sub("^[^@]*@","",web_page_collect$id)
web_page_collect$id <- gsub(").*","",web_page_collect$id)
web_page_collect$id <- gsub(" .*","",web_page_collect$id)
web_page_collect$id <- gsub('[â]', '', web_page_collect$id)
web_page_collect$id <- gsub('[???]', '', web_page_collect$id)
web_page_collect <- head(web_page_collect,-(nrow(web_page_collect)-1))
答案 0 :(得分:0)
如果你需要的只是括号之间的用户名。
library(data.table)
web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(web_page_read) <- "id"
web_page_collect <- web_page_read[web_page_read$id %like% '@',]
web_page_collect <- as.data.frame(web_page_collect)
colnames(web_page_collect) <- "id"
web_page_collect$id <- gsub("(^.+\\(@)(.+)(\\).+$)","\\2",web_page_collect$id)
使用正则表达式捕获具有gsub
功能的组。松散翻译,您将组定义为... Group1:字符串的开始直到(@
,Group2:所有字符,Group3:一直持续到)
到字符串的结尾。然后选择第二组。
答案 1 :(得分:0)
如果我没有误会,您还可以在“(@
”和“)
”之间搜索文字:
insta <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(insta) <- "id"
collect <- insta[grep("@", insta$id), ]
reg_loc <- gregexpr("(?<=\\(@)(.*)(?=\\))", collect, perl = TRUE)
unlist(regmatches(x = collect, reg_loc))
# [1] "jim.n.tonic" "jim.n.tonic"