除了硒以外,是否有其他方法可以从“ https://www.instagram.com/explore/tags/SOMEHASHTAGHERE/”获得“ a href”? 在api的帮助下,我只能获得具有以下类型的图片的链接: https://instagram.fhel6-1.fna.fbcdn.net/vp/b6c669ed3b5be0dc9c183412d738acac/5CEC3935/t51.2885-15/e35/c119.0.842.842/s240x240/49787501_1587577534678419_6308372780046107029_n.jpg?_nc_ht=instagram.fhel6-1.fna.fbcdn.net 我不需要这个我想得到这样的链接“ https://www.instagram.com/p/BuGpLWsFioq/”。我正在尝试使用bs4和'lxml'解析器来执行此操作,但是在html中使用不带'a href'的结果。 我需要知道是否可以刮除此信息?很明显,javascript会生成更多信息。因此,这是一种除硒Webdriver以外的刮取这些数据的方法吗?
答案 0 :(得分:1)
您要查找的所有信息都在<script type=text/javacript>
内
您可以使用以下正则表达式获取它:
from bs4 import BeautifulSoup as soup
import requests
import json
import re
def _get_json_footer(html):
s = str(html)
r = re.compile('"entry_data":(.*?),"gatekeepers"')
m = r.search(s)
if m:
result = m.group(1)
return json.loads(result)
url = 'https://www.instagram.com/explore/tags/SOMEHASHTAGHERE/'
page = requests.get(url)
html = soup(page.text, 'html.parser')
json_footer = _get_json_footer(html)
tagpage = json_footer.get('TagPage')
然后您可以在tagpage
字典中导航以获取数据
编辑:
要获取帖子链接,您只需在tagpage
字典中导航:
from bs4 import BeautifulSoup as soup
import requests
import json
import re
def _get_json_footer(html):
s = str(html)
r = re.compile('"entry_data":(.*?),"gatekeepers"')
m = r.search(s)
if m:
result = m.group(1)
return json.loads(result)
url = 'https://www.instagram.com/explore/tags/SOMEHASHTAGHERE/'
page = requests.get(url)
html = soup(page.text, 'html.parser')
json_footer = _get_json_footer(html)
tagpage = json_footer.get('TagPage')
links = []
edges = tagpage[0].get('graphql',{}).get('hashtag',{}).get('edge_hashtag_to_media',{}).get('edges',[])
for e in edges:
links.append("https://www.instagram.com/p/"+e.get('node',{}).get('shortcode','')+'/')
print(links)
输出:
['https://www.instagram.com/p/Bsh4UcdBRvY/', 'https://www.instagram.com/p/Bq8vAMRHtGB/', 'https://www.instagram.com/p/Bn_vfeWhcYL/', 'https://www.instagram.com/p/Bm1QRb2ntWL/', 'https://www.instagram.com/p/Bj5pLHAnVuY/', 'https://www.instagram.com/p/Bfn2QWiHKK5/', 'https://www.instagram.com/p/BfC4ZnTntq0/', 'https://www.instagram.com/p/BeomaB6Hb8-/', 'https://www.instagram.com/p/vYszwjyLdB/', 'https://www.instagram.com/p/sQI6Jfpi3f/', 'https://www.instagram.com/p/sO9oXPMr6K/', 'https://www.instagram.com/p/qzvHuCHUgH/', 'https://www.instagram.com/p/WdlKcCBW3w/']
您可以通过edge_hashtag_to_media
将键edge_hashtag_to_top_posts
更改为其他值
答案 1 :(得分:0)
让我知道这就是您要照顾的东西。
private void init(Context context, AttributeSet attrs){
paint=new Paint();
paint.setAntiAlias(true);
}
输出:
from bs4 import BeautifulSoup
import requests
resp=requests.get("https://www.instagram.com/explore/tags/SOMEHASHTAGHERE/")
html = resp.content
soup = BeautifulSoup(html,'html.parser')
for a in soup.find_all('link',rel='alternate',href=True):
print "Found the URL:", a['href']