Question

我使用以下代码

抓取BS4的广告属性广告

# get_ad_page_urls collects all ad urls displayed on page
def get_ad_page_urls(link): 
    BS4_main(link) # BS4_main parses the link and returns the "container" object
    return [link.get("href") for link in container.findAll("a", href=re.compile("^(/inmueble/)((?!:).)*$"))]

# get_ad_data obtains data from each ad
def get_ad_data(ad_page_url):
    ad_data={}
    response=requests.get(root_url+ad_page_url)
    soup = bs4.BeautifulSoup(response.content, 'lxml')

    <collecting data code here>

    return ad_data

这很好用。通过以下多处理代码，我刮掉了所有广告，

def show_ad_data(options):
    pool=Pool(options)
    for link in page_link_list:
        ad_page_urls = get_ad_page_urls(link)
        results=pool.map(get_ad_data, ad_page_urls)

现在问题：

应跳过特定广告。这些广告会显示特定文字，通过这些文字可以识别。我是使用def功能的新手，我不知道如何告诉代码跳到下一个ad_page_url。

我认为＆＃34;跳过＆＃34;代码应放在soup = bs4.BeautifulSoup(response.content, 'lxml')和<collecting data code here>之间。像，

# "skipping" semi-code
for text in soup:
    if 'specific text' in text:
        continue

但我不确定使用def函数是否允许应用continue 在迭代上。

我应该如何修改代码，以便在页面上显示specific文字时跳过广告？

Answer 1

是的，如果在if语句中满足跳过条件，则继续或传递将继续跳到下一次迭代：

def get_ad_data(ad_page_url):
    ad_data={}
    response=requests.get(root_url+ad_page_url)
    soup = bs4.BeautifulSoup(response.content, 'lxml')

    for text in soup:
    if 'specific text' in text:
        continue #or pass
    else:
        <collecting data code here>

    return ad_data

Python：有条件地在抓取过程中跳过URL

1 个答案: