如何用美丽的汤提取其标题的网址?

时间:2017-04-13 17:53:06

标签: python python-3.x pandas beautifulsoup lxml

我有一个我感兴趣的链接列表:

lis = ['https://example1.com', 'https://example2.com', ..., 'https://exampleN.com']

在这些链接中有几个网址,我想提取一些特定的内部网址。这样的网址有这样的形式:

<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>

如何检查lis的所有元素并返回lis访问过的链接,只返回在pandas数据框中作为标题Url to news的网址? ,像这样(**):

visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://interesting-linkN.com

请注意,对于lis没有任何<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>的元素,我想返回NaN

我尝试了this和:

def extract_jpg_url(a_link):
    page = requests.get(a_link)
    tree = html.fromstring(page.content)
    # here is the problem... not all interesting links have this xpath, how can I select by title?
    #(apparently all the jpg urls have this form: title="Url to news")
    interesting_link = tree.xpath(".//*[@id='object']//tbody//tr//td//span//a/@href")
    if len(interesting_link) == 0:
        return'NaN'
    else:
        return 'image link ', interesting_link
then:

    df['news_link'] = df['urls_from_lis'].apply(extract_jpg_url)

然而,后一种方法花费的时间太长而且并非lis的所有元素都与给定的xpath匹配(检查注释)我能得到什么(**)?

1 个答案:

答案 0 :(得分:1)

这不会完全返回您想要的内容(NaN),但它会让您大致了解如何简单有效地完成这项工作。

from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import requests

def extract_urls(link):
    r = requests.get(link)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")
    results = soup.findAll('a', {'title': 'Url to news'})
    results = [x['href'] for x in results]
    return (link, results)

links = [
    "https://example1.com",
    "https://example2.com",
    "https://exampleN.com", ]

p = ThreadPool(10)
r = p.map(extract_urls, links)

for url, results in r:
    print(url, results)