如何保留包含"。"的用户名。当我使用stringR函数?

时间:2017-09-13 16:19:51

标签: r string web-scraping extract stringr

我希望能够隔离Instagram用户的用户名,我一直使用@符号作为识别pagesource中用户的方法。

我的问题是,当用户的名字包含。时,我的代码会从中移除所有内容。给我不完整的用户名。

我需要的是完整的用户名,即保留所有用户名,直到出现空格。

    web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
    colnames(web_page_read) <- "id"
    web_page_collect <- web_page_read[web_page_read$id  %like% '@',]
    web_page_collect <- as.data.frame(web_page_collect)
    colnames(web_page_collect) <- "id"
    web_page_collect$id <- str_extract(web_page_collect$id, "(?<=@)\\w+")
    web_page_collect$id  <- sub("^[^@]*@","",web_page_collect$id)
    web_page_collect$id  <- gsub(").*","",web_page_collect$id)
    web_page_collect$id  <- gsub(" .*","",web_page_collect$id)
    web_page_collect$id <- gsub('[â]', '', web_page_collect$id)
    web_page_collect$id <- gsub('[???]', '', web_page_collect$id)
    web_page_collect <- head(web_page_collect,-(nrow(web_page_collect)-1))

2 个答案:

答案 0 :(得分:0)

如果你需要的只是括号之间的用户名。

library(data.table)
web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(web_page_read) <- "id"
web_page_collect <- web_page_read[web_page_read$id  %like% '@',]
web_page_collect <- as.data.frame(web_page_collect)
colnames(web_page_collect) <- "id"
web_page_collect$id <- gsub("(^.+\\(@)(.+)(\\).+$)","\\2",web_page_collect$id)

使用正则表达式捕获具有gsub功能的组。松散翻译,您将组定义为... Group1:字符串的开始直到(@,Group2:所有字符,Group3:一直持续到)到字符串的结尾。然后选择第二组。

答案 1 :(得分:0)

如果我没有误会,您还可以在“(@”和“)”之间搜索文字:

insta <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(insta) <- "id"
collect <- insta[grep("@", insta$id), ]

reg_loc <- gregexpr("(?<=\\(@)(.*)(?=\\))", collect, perl = TRUE)
unlist(regmatches(x = collect, reg_loc))
# [1] "jim.n.tonic" "jim.n.tonic"