网页抓取的正则表达式

时间:2019-07-17 09:36:42

标签: regex ruby web-scraping

我要剪贴的网站,需要一个正则表达式以匹配以下数据。范例:我需要抓住

1)“ Antoinette Denis”考虑到某些名称只是一个,没有姓氏。 2)“ 2019-07-16” 3)评论,在这种情况下,最后一段“我已经尝试过...”

{\"socialShareUrl\":\"https://au.trustpilot.com/reviews/5d2e47aeccd70b084c6255e8\",\"businessUnitId\":\"5bdc1f534c2c1b0001dc2b39\",\"businessUnitDisplayName\":\"Shapermint\",\"consumerId\":\"5d2e47ad9192678da1522016\",\"consumerName\":\"Antoinette Denis\",\"reviewId\":\"5d2e47aeccd70b084c6255e8\",\"stars\":5}\n\n\n\n\n \n \n\n\n \n \n \n \n Antoinette Davis\n \n \n \n \n 1 review\n \n \n \n\n\n \n\n \n\n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n\n{\"publishedDate\":\"2019-07-16T21:54:54Z\",\"updatedDate\":null,\"reportedDate\":null}\n\n\n\n \n \n \n\n{\"businessUnitDisplayName\":\"Shapermint\",\"consumerName\":\"Antoinette Denis\",\"informationRequestStatus\":\"none\",\"isVerified\":true,\"verificationSource\":\"invitation\"}\n\n \n\n \n\n \n \n \n Excellent product\n \n \n I have tried spanks and just not comfortable in them but this really works and is very comfortable it was a very pleasant surprise\n \n \n\n \n\n \n \n\n\n \n \n\n \n \n \n \n Useful\n \n \n \n\n \n\n\n \n \n \n \n Share\n \n \n \n \n\n \n \n \n \n Reply

我有这个表达,但是我不知道如何一起工作:

pattern_for_name = /"consumerName\\":\\"(?<name>\w* \w*)/
pattern_for_date = /"publishedDate\\":\\"(?<date>\d*-\d*-\d*)/

1 个答案:

答案 0 :(得分:1)

请勿使用正则表达式解析HTML。在这里,您感兴趣的大部分位于JSON对象内,请使用它。

假设将整个字符串分配给data,请执行以下操作:

jsons = data.scan(/{.*?}/).map(&JSON.method(:parse))

现在只需取回您的数据:

[jsons.first["consumerName"], jsons.last["publishedDate"]]
#⇒ ["Antoinette Denis", "2019-07-16T21:54:54Z"]