Question

我正在为R中的文本挖掘项目清理推文。我有这样的url链接的推文：

> inspect(data_corpus[1])

[1] Twirling around in the prettiest winery garden. @Newton Vineyard https://www. instagram.com/p/BVLZnSGDRG1/

问题1：

我想从https（https://www。Instagram.com/p/BVLZnSGDRG1 /）开始删除整个链接

我尝试应用此代码，但它只是删除.com而不是完整链接

toSpace = content_transformer( function(x, pattern) gsub(pattern," ",x) )
data_clean = tm_map(data_corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")


> inspect(data_clean[1])

[1] Twirling around in the prettiest winery garden.  @Newton Vineyard  /p/BVLZnSGDRG1/

我想删除引用＆＃34; / p / BVLZnSGDRG1 /＆＃34;同样。怎么做？

问题2 ：

我想删除以@开头的屏幕名称。但是在上面的推文中，屏幕名称是（@Newton Vineyard）。当我应用下面的代码时，它只是单独删除@Newton而不是Vineyard。

data_clean = tm_map(data_clean, toSpace, "@[a-z,A-Z]*")

结果：

> inspect(data_clean[1])

[1] Twirling around in the prettiest winery garden.     Vineyard  /p/BVLZnSGDRG1/

是否可以删除＆＃34; Vineyard＆＃34;还有？

我担心的是，如果＃34; Vineyard＆＃34;不是屏幕名称的一部分，它实际上是推文的字符串部分？是否有可能检查＆＃34; Vineyard＆＃34;是屏幕名称的一部分，只有这样删除？

Answer 1

如果要删除网址和用户名，可以使用

@(?!.*@).*$

删除最后的@符号以及除此之外的所有内容。说明：

@       # literal character '@'
(?!     # negative lookahead (aka 'not followed by')
  .*    # any number of any character (except newlines)
  @     # literal character '@'
)       # end negative lookahead
.*      # any number of character
$       # end of line

这个正则表达式匹配@后跟@，以及从它到行尾的所有内容。

清理R中的文本：删除Web URL链接参考

1 个答案: