我有一个我感兴趣的链接列表:
lis = ['https://example1.com', 'https://example2.com', ..., 'https://exampleN.com']
在这些链接中有几个网址,我想提取一些特定的内部网址。这样的网址有这样的形式:
<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>
如何检查lis
的所有元素并返回lis
访问过的链接,只返回在pandas数据框中作为标题Url to news
的网址? ,像这样(**):
visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://interesting-linkN.com
请注意,对于lis
没有任何<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>
的元素,我想返回NaN
。
我尝试了this和:
def extract_jpg_url(a_link):
page = requests.get(a_link)
tree = html.fromstring(page.content)
# here is the problem... not all interesting links have this xpath, how can I select by title?
#(apparently all the jpg urls have this form: title="Url to news")
interesting_link = tree.xpath(".//*[@id='object']//tbody//tr//td//span//a/@href")
if len(interesting_link) == 0:
return'NaN'
else:
return 'image link ', interesting_link
then:
df['news_link'] = df['urls_from_lis'].apply(extract_jpg_url)
然而,后一种方法花费的时间太长而且并非lis
的所有元素都与给定的xpath匹配(检查注释)我能得到什么(**)?
答案 0 :(得分:1)
这不会完全返回您想要的内容(NaN),但它会让您大致了解如何简单有效地完成这项工作。
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import requests
def extract_urls(link):
r = requests.get(link)
html = r.text
soup = BeautifulSoup(html, "html.parser")
results = soup.findAll('a', {'title': 'Url to news'})
results = [x['href'] for x in results]
return (link, results)
links = [
"https://example1.com",
"https://example2.com",
"https://exampleN.com", ]
p = ThreadPool(10)
r = p.map(extract_urls, links)
for url, results in r:
print(url, results)