Question

我希望能够隔离Instagram用户的用户名，我一直使用@符号作为识别pagesource中用户的方法。

我的问题是，当用户的名字包含。时，我的代码会从中移除所有内容。给我不完整的用户名。

我需要的是完整的用户名，即保留所有用户名，直到出现空格。

    web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
    colnames(web_page_read) <- "id"
    web_page_collect <- web_page_read[web_page_read$id  %like% '@',]
    web_page_collect <- as.data.frame(web_page_collect)
    colnames(web_page_collect) <- "id"
    web_page_collect$id <- str_extract(web_page_collect$id, "(?<=@)\\w+")
    web_page_collect$id  <- sub("^[^@]*@","",web_page_collect$id)
    web_page_collect$id  <- gsub(").*","",web_page_collect$id)
    web_page_collect$id  <- gsub(" .*","",web_page_collect$id)
    web_page_collect$id <- gsub('[â]', '', web_page_collect$id)
    web_page_collect$id <- gsub('[???]', '', web_page_collect$id)
    web_page_collect <- head(web_page_collect,-(nrow(web_page_collect)-1))

Answer 1

如果你需要的只是括号之间的用户名。

library(data.table)
web_page_read <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(web_page_read) <- "id"
web_page_collect <- web_page_read[web_page_read$id  %like% '@',]
web_page_collect <- as.data.frame(web_page_collect)
colnames(web_page_collect) <- "id"
web_page_collect$id <- gsub("(^.+\\(@)(.+)(\\).+$)","\\2",web_page_collect$id)

使用正则表达式捕获具有gsub功能的组。松散翻译，您将组定义为... Group1：字符串的开始直到(@，Group2：所有字符，Group3：一直持续到)到字符串的结尾。然后选择第二组。

Answer 2

如果我没有误会，您还可以在“(@”和“)”之间搜索文字：

insta <- read.csv('https://www.instagram.com/p/BY0i2O0FxHl/')
colnames(insta) <- "id"
collect <- insta[grep("@", insta$id), ]

reg_loc <- gregexpr("(?<=\\(@)(.*)(?=\\))", collect, perl = TRUE)
unlist(regmatches(x = collect, reg_loc))
# [1] "jim.n.tonic" "jim.n.tonic"

如何保留包含＆＃34;。＆＃34;的用户名。当我使用stringR函数？

2 个答案: