用更少的线程连续探测多个ULR,如何控制线程

时间:2018-08-23 02:33:20

标签: python multithreading

背景

我想监视100个URL(拍摄快照,如果内容与以前的内容不同,则将其存储),我的计划是使用urllib.request每隔x分钟扫描一次,例如x = 5,不间断。

所以我不能使用单个for循环和睡眠,因为我想开始检测ULR1,然后几乎同时开始URL2。

while TRUE:
  for url in urlList:
    do_detection()
    time.sleep(sleepLength)

因此,我应该使用游泳池?但是我应该将线程限制为CPU可以处理的少量线程(如果我有100个ULR,则不能设置为100个线程)

我的问题:

即使我可以使用四个线程将列表中的100个URL发送到ThreadPool(4),如何设计以控制每个线程以处理100/4 = 25个URL,因此线程会探测URL1,sleep(300)在下一次探查URL1之前,然后执行URL2 .... ULR25并返回到URL1 ...?我不想等待5分钟* 25的整个周期。

伪代码或示例将有很大帮助!我找不到或想不出一种方法来使looper()和detector()发挥所需的作用?

(我认为How to scrap multiple html page in parallel with beautifulsoup in python?这是很接近的答案,但不是确切答案)

也许每个线程都这样?我将尝试找出如何现在将100个项目拆分到每个线程中。使用pool.map(func,iterable [,chunksize])获取一个列表,我可以将chunksize设置为25。

def one_thread(Url):

    For url in Url[0:24]:
          CurrentDetect(url)
    if 300-timelapsed>0:
        remain_sleeping=300-timtlapsed
    else:
        remain_sleeping=0


    sleep (remain_sleeping)

    For url in Url[0:24]:
          NextDetect()

我要编写的无效代码:

import urllib.request as req
import time
def url_reader(url = "http://stackoverflow.com"):

    try
        f = req.urlopen(url)
        print (f.read())

    except Exception as err
        print (err)

def save_state():
    pass
    return []

def looper (sleepLength=720,urlList):
    for url in urlList: #initial save
        Latest_saved.append(save_state(url_reader(url))) # return a list
    while TRUE:
        pool = ThreadPool(4) 


        results = pool.map(urllib2.urlopen, urls)
        time.sleep(sleepLength)  # how to parallel this? if we have 100 urls, then takes 100*20 min to loop?
        detector(urlList) #? use last saved status returned to compare?

def detector (urlList):




    for url in urlList:
            contentFirst=url_reader(url)

            contentNext=url_reader(url)

            if contentFirst!=contentNext:
                save_state(contentFirst)
                save_state(contentNext)

1 个答案:

答案 0 :(得分:0)

您需要安装请求

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""    

def cleanString(text):
    res = []
    for i in text.strip().split():
        if not re.search(r"(https?)", i):   #Removes URL..Note: Works only if http or https in string.
            res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " "))   #Strip everything that is not alphabet(Upper or Lower)
    return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))

如果要使用以下代码:

pip install requests