背景:
我想监视100个URL(拍摄快照,如果内容与以前的内容不同,则将其存储),我的计划是使用urllib.request每隔x分钟扫描一次,例如x = 5,不间断。
所以我不能使用单个for循环和睡眠,因为我想开始检测ULR1,然后几乎同时开始URL2。
while TRUE:
for url in urlList:
do_detection()
time.sleep(sleepLength)
因此,我应该使用游泳池?但是我应该将线程限制为CPU可以处理的少量线程(如果我有100个ULR,则不能设置为100个线程)
我的问题:
即使我可以使用四个线程将列表中的100个URL发送到ThreadPool(4),如何设计以控制每个线程以处理100/4 = 25个URL,因此线程会探测URL1,sleep(300)在下一次探查URL1之前,然后执行URL2 .... ULR25并返回到URL1 ...?我不想等待5分钟* 25的整个周期。
伪代码或示例将有很大帮助!我找不到或想不出一种方法来使looper()和detector()发挥所需的作用?
(我认为How to scrap multiple html page in parallel with beautifulsoup in python?这是很接近的答案,但不是确切答案)
也许每个线程都这样?我将尝试找出如何现在将100个项目拆分到每个线程中。使用pool.map(func,iterable [,chunksize])获取一个列表,我可以将chunksize设置为25。
def one_thread(Url):
For url in Url[0:24]:
CurrentDetect(url)
if 300-timelapsed>0:
remain_sleeping=300-timtlapsed
else:
remain_sleeping=0
sleep (remain_sleeping)
For url in Url[0:24]:
NextDetect()
我要编写的无效代码:
import urllib.request as req
import time
def url_reader(url = "http://stackoverflow.com"):
try
f = req.urlopen(url)
print (f.read())
except Exception as err
print (err)
def save_state():
pass
return []
def looper (sleepLength=720,urlList):
for url in urlList: #initial save
Latest_saved.append(save_state(url_reader(url))) # return a list
while TRUE:
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
time.sleep(sleepLength) # how to parallel this? if we have 100 urls, then takes 100*20 min to loop?
detector(urlList) #? use last saved status returned to compare?
def detector (urlList):
for url in urlList:
contentFirst=url_reader(url)
contentNext=url_reader(url)
if contentFirst!=contentNext:
save_state(contentFirst)
save_state(contentNext)
答案 0 :(得分:0)
您需要安装请求,
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))
如果要使用以下代码:
pip install requests