我已经用python编写了一个脚本,可以从instagram提取username
,followers
和posts
的某些帐户。当我运行脚本时,我可以看到它的行为很奇怪。更清楚一点-我尝试使用三个帐户和
这是我得到的结果:
('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '', '')
('louisnpearls', '', '080 posts')
我希望得到的东西:
('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '6.4m followers', '1,700 posts')
('louisnpearls', '55.5k followers', '080 posts')
我尝试过的脚本:
import re
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.instagram.com/backstreetboys/',
'https://www.instagram.com/akon/',
'https://www.instagram.com/louisnpearls/'
]
def get_instagram_info(url):
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
username = soup.select_one("meta[property='al:ios:url']").get("content").split("=")[-1]
try:
desc = soup.select_one("meta[property='og:description']").get("content")
except Exception: desc = ""
try:
followers = re.findall(r".*(?<=Followers)",desc,re.I)[0]
except Exception: followers = ""
try:
posts = re.findall(r"[^,]+(?<=Posts)",desc,re.I)[0]
except Exception: posts = ""
return username,followers,posts
if __name__ == '__main__':
for url in urls:
print(get_instagram_info(url))
我应该进行哪些可能的更改,以使脚本使用请求来相应地获取上述字段?
答案 0 :(得分:1)
如果您看一下提取的元描述,那么那里提取的数字就不存在。您的方法可能仅适用于某些帐户,而不适用于其他帐户。我的方法使用存储在页面源中的json数据。另外,我相信如果您想查看一下,可以使用Instagram api。
import json
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.instagram.com/backstreetboys/',
'https://www.instagram.com/akon/',
'https://www.instagram.com/louisnpearls/'
]
def get_instagram_info(url):
res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
script_data = [script.text for script in soup.find_all('script') if script.text[:18] == 'window._sharedData'][0]
script_json = json.loads(script_data[21:-1])
username = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['username']
followers = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count']
posts = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['count']
return username, followers, posts
if __name__ == '__main__':
for url in urls:
print(get_instagram_info(url))
('backstreetboys', 2279332, 2152)
('akon', 6476386, 1700)
('louisnpearls', 55513, 1080)