从页面源python获取特定部分

时间:2018-10-31 11:09:12

标签: python json regex

我正在尝试使用正则表达式从页面中提取特定部分,但无法正常工作。

这是我要从页面中提取的部分:

{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}

到目前为止,我已经尝试过:

import requests
import re


r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

mystrx = re.search(r'^{"clickTrackingParams".*"voteStatus":"LIKE"}}]}}', html_source)

但是它对我没有用。

1 个答案:

答案 0 :(得分:1)

尝试一下:

import requests
import re

r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'

# Find first occurence
end = html_source.find(snd)

# Get closest index
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)

print(html_source[start:end+len(snd)])

哪些输出:

{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}

如果您想第二次出现,可以尝试以下方法:

import requests
import re

r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text

fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'

def find_nth(string, to_find, n):
    """
    Finds nth match from string
    """

    # find all occurences
    matches = [idx.start() for idx in re.finditer(to_find, string)]

    # return nth match
    return matches[n]

# finds second match
end = find_nth(html_source, snd, 1)

# Gets closest index to end
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)

print(html_source[start:end+len(snd)])

注意:在第二个示例中,如果您请求在找到的匹配项之外进行匹配,则可以遇到IndexError。您需要自己处理此行为。