无法使用线程以正确的方式执行我的脚本

时间:2018-09-05 18:08:47

标签: python python-3.x web-scraping lxml python-multithreading

我已经尝试使用python与 Thread 结合使用来创建刮板,以缩短执行时间。刮板应该解析所有商店名称以及遍历多页的电话号码。

脚本正在运行,没有任何问题。由于我刚接触 Thread ,因此我很难理解自己的使用方式是否正确。

这是我到目前为止尝试过的:

import requests 
from lxml import html
import threading
from urllib.parse import urljoin 

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def get_information(url):
    for pagelink in [url.format(page) for page in range(20)]:
        response = requests.get(pagelink).text
        tree = html.fromstring(response)
        for title in tree.cssselect("div.info"):
            name = title.cssselect("a.business-name span[itemprop=name]")[0].text
            try:
                phone = title.cssselect("div[itemprop=telephone]")[0].text
            except Exception: phone = ""
            print(f'{name} {phone}')

thread = threading.Thread(target=get_information, args=(link,))

thread.start()
thread.join()
  

问题是 我找不到时间或性能上的任何差异 ,无论我是否使用 Thread < / em> 或不使用 Thread 。如果我做错了,如何使用 Thread 执行上述脚本?

编辑:我试图更改逻辑以使用多个链接。现在可以吗?预先感谢。

1 个答案:

答案 0 :(得分:2)

您可以使用Threading在paralel中抓取多个页面,如下所示:

Get-ADUser -filter {(enabled -eq $True) -and (extensionAttribute4 -eq "LoadedFromInterface")} -Properties Name, GivenName,SN,Office, Mobile, emailaddress,Department, Title, samaccountname, manager,officephone,homephone,extensionAttribute5,extensionAttribute6 | `
Select-object @{Name='User';Expression={$_."SamAccountName"}},
@{Name='First Name';Expression={$_."GivenName"}},
@{Name='Last Name';Expression={$_."SN"}},
@{Name='Site';Expression={$_."Office"}}, 
@{Name='Work Email';Expression={$_."emailAddress"}},
@{N='Home Email';E={''}}, 
@{Name='Work Phone';Expression={if ($_."officephone" -eq $null){""} else {'1'+ $_."officephone" -replace "\D"}}},  
@{Name='Home Phone';Expression={if ($_."extensionAttribute5" -notlike '*'){""} else {'1'+ $_."extensionAttribute5" -replace "\D"}}}, 
@{Name='Mobile Phone';Expression={if ($_."extensionAttribute6" -notlike '*'){""} else {'1'+ $_."extensionAttribute6" -replace "\D"}}}, 
@{N='Mobile Phone 2';E={''}},
@{Name='Personal Mobile Phone';Expression={''}},
@{Name='Personal Mobile Phone 2';Expression={''}},
@{Name='SMS Phone';Expression={''}},
@{Name='SMS Phone 2';Expression={''}},
@{Name='Personal SMS Phone';Expression={''}},
@{Name='Personal SMS Phone 2';Expression={''}},
@{Name='Pager';Expression={''}},
@{Name='Pager Provider';Expression={''}},
@{Name='Fax';Expression={''}},
@{Name='IVR';Expression={''}}, 
Department, 
@{Name='Job Title';Expression={$_."title"}}, 
@{N='Manager';E={(Get-ADUser $_.Manager).Name}} | `
Export-CSV -NoTypeInformation c:\temp\User_Input.csv

请注意,数据序列将不会保留。这意味着,如果要按一页的顺序依次提取页面,则提取的数据将是:

import requests
from lxml import html
import threading
from urllib.parse import urljoin

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def get_information(url):
    response = requests.get(url).text
    tree = html.fromstring(response)
    for title in tree.cssselect("div.info"):
        name = title.cssselect("a.business-name span[itemprop=name]")[0].text
        try:
            phone = title.cssselect("div[itemprop=telephone]")[0].text
        except Exception: phone = ""
        print(f'{name} {phone}')

threads = []
for url in [link.format(page) for page in range(20)]:
    thread = threading.Thread(target=get_information, args=(url,))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()

在使用线程时,数据将被混合:

page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3