Python美丽的汤

时间:2017-04-18 21:53:35

标签: python beautifulsoup

我正在学习Python的Beautiful Soup并尝试解析网站“https://www.twitteraudit.com/”。当我在搜索栏中输入twitter id时,它会在几分之一秒内返回某个id的结果,但是某些id大约需要一分钟来处理数据。在这种情况下,如何在加载HTML或结果完成后解析HTML?我试图循环它,但它不会那样工作。但我想到的是,如果我打开一个浏览器并加载网络链接,一旦它完成它将缓存存储在计算机中,下次当我运行相同的ID它完美的工作。

任何人都可以帮我解决这个问题吗?我很感激帮助。我附上以下代码>>

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
from re import sub

def HTML(myURL):
    uClient = uReq(myURL)
    pageHTML = uClient.read()
    uClient.close()

    pageSoup = soup(pageHTML, "html.parser")
    return pageSoup

def fakecheck(usr):
    myURLfc = "https://www.twitteraudit.com/" + usr
    pgSoup = HTML(myURLfc)

    foll = pgSoup.findAll("div",{"class":"audit"})


    link = foll[0].div.a["href"]
    real = foll[0].findAll("span",{"class":"real number"})[0]["data-value"]
    fake = foll[0].findAll("span",{"class":"fake number"})[0]["data-value"]
    scr = foll[0].findAll("div",{"class":"score"})[0].div
    scoresent = scr["class"][1]
    score = re.findall(r'\d{1,3}',str(scr))[0]
    return [link, real, fake, scoresent, score]


lis = ["BarackObama","POTUS44","ObamaWhiteHouse","MichelleObama","ObamaFoundation","NSC44","ObamaNews","WhiteHouseCEQ44","IsThatBarrak","obama_barrak","theprezident","barrakubama","BarrakObama","banackkobama","YusssufferObama","barrakisdabomb_","BarrakObmma","fuzzyjellymasta","BarrakObama6","bannalover101","therealbarrak","ObamaBarrak666","barrak_obama"]

for u in lis:
    link, real, fake, scoresent, score = fakecheck(u)

    print ("link : " + link)
    print ("Real : " + real)
    print ("Fake : " + fake)
    print ("Result : " + scoresent)
    print ("Score : " + score)
    print ("=================")

1 个答案:

答案 0 :(得分:0)

我认为问题是某些Twitter ID尚未经过审核,所以我得到了IndexError。但是,将fakecheck(u)的调用置于捕获该错误的while True:循环中将持续检查网站,直到对该ID执行了审核。

我把这段代码放在lis定义之后:

def get_fake_check(n):
    return fakecheck(n)

for u in lis:
    while True:
        try:
            link, real, fake, scoresent, score = get_fake_check(u)
            break
        except:
            pass

我不确定是否有办法在网站上自动化审核请求,但是当查询等待时,我手动点击网站上该ID的“审核”按钮,一旦审核完成,脚本将照常继续,直到所有ID审核都得到处理。