我正在尝试使用正则表达式从页面中提取特定部分,但无法正常工作。
这是我要从页面中提取的部分:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
到目前为止,我已经尝试过:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
mystrx = re.search(r'^{"clickTrackingParams".*"voteStatus":"LIKE"}}]}}', html_source)
但是它对我没有用。
答案 0 :(得分:1)
尝试一下:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
# Find first occurence
end = html_source.find(snd)
# Get closest index
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
哪些输出:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
如果您想第二次出现,可以尝试以下方法:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
def find_nth(string, to_find, n):
"""
Finds nth match from string
"""
# find all occurences
matches = [idx.start() for idx in re.finditer(to_find, string)]
# return nth match
return matches[n]
# finds second match
end = find_nth(html_source, snd, 1)
# Gets closest index to end
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
注意:在第二个示例中,如果您请求在找到的匹配项之外进行匹配,则可以遇到IndexError
。您需要自己处理此行为。