我试图提取符合模式的字符串' {" comments_disabled":'和'}},' 然后追加这两种模式之间的任何拟合。 (这些模式之间可能存在100多种匹配。
问题是下面的代码只是一直提取第一次出现,如何让它忽略之前附加到userpost列表的内容并移到下一个?
from bs4 import BeautifulSoup
page = urlopen("https://www.instagram.com/explore/tags/fun/")
soup = BeautifulSoup(page,"html.parser")
title = soup.title
script = str(soup.findAll('script', type="text/javascript"))
userpost = list()
for text in script:
userpost.append(script[script.find('{"comments_disabled":')/
:script.find('}},')+2])
答案 0 :(得分:1)
尝试re.findall()
:
userpost = re.findall(r'{"comments disabled":(.*?)}},', script)
经测试的脚本:
import re
script = '''
{"comments disabled": one two }},
alpha beta
{"comments disabled": three four }},
{"comments disabled":
five six
}},
'''
userpost = re.findall(r'{"comments disabled":(.*?)}},', script, re.DOTALL)
print(userpost)
输出:
[' one two ', ' three four ', '\nfive six\n']