作为序言,我想制作一个Twitter抓取程序,它可以比tweetdeck更快,并且比流式API可以提供给我的更快,可以在新的tweet上为我更新。当我向要监控的新推文的页面发出请求时,在按新方式发布新推文时,程序不会更改其输出。当前,我的代码应向https://twitter.com/username发出多个异步请求,并且它返回前两个鸣叫(包括固定的鸣叫)。如何调整请求,以便在程序运行时用新的推文更新页面?
我仍在尝试了解aiohttp库,因此我无法进行很多调整并没有取得成功。
import requests
import re
import time
import aiohttp
import asyncio
from bs4 import BeautifulSoup as bs
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
# takes a url, and takes number of tweets to print
async def get_recent(username, n):
base_link = 'https://twitter.com'
url = base_link + '/' + username
async with aiohttp.ClientSession() as session:
data_text = await fetch(session, url)
# data = requests.get(url)
recent_tweets = []
html = bs(data_text, 'html.parser')
# get timeline
timeline = html.select('#timeline li.stream-item')
# DEBUG makes a file to see the exact html we're working with, but
# formatted nicely. Uncomment the next two lines to do so.
# with open('html.html', 'w', encoding='utf-8') as f_out:
# f_out.write(html.prettify())
for tweet in timeline[:n]:
PARSE STUFF [deleted for clarity]
# output to a list of dictionaries
recent_tweets.append({"id": tweet_id, "text": tweet_text, "link_to_tweet": tweet_link, "links": in_tweet_links, "link_to_pic": pic_link})
print(recent_tweets)
然后在我的主要功能中
loop = asyncio.get_event_loop()
all_groups = asyncio.gather(*[get_recent('username', 2) for _ in range(20)])
results = loop.run_until_complete(all_groups)
据我了解,这应该发出20个请求,并给我相应时间轴的前2条推文。如果我在程序运行时提交了一条Tweet,则在程序停止并再次运行之前,输出不会反映新的Tweet。