我的抓取工具只有在链接文本包含给定文本并且我将输出写入html文件时才从页面中提取链接。它的工作,但我想在这些链接旁边添加整个链接文本,像这样 - "初级Java开发人员 - https://www.jobs.cz/junior-developer/"我怎样才能做到这一点?
由于
import requests
from bs4 import BeautifulSoup
import re
def jobs_crawler(max_pages):
page = 1
file_name = 'links.html'
while page < max_pages:
url = 'https://www.jobs.cz/prace/praha/?field%5B%5D=200900011&field%5B%5D=200900012&field%5B%5D=200900013&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
page += 1
file = open(file_name,'w')
for link in soup.find_all('a', {'class': 'search-list__main-info__title__link'}, text=re.compile('IT', re.IGNORECASE)):
href = link.get('href') + '\n'
file.write('<a href="' + href + '">'+ 'LINK TEXT HERE' + '</a>' + '<br />')
print(href)
file.close()
print('Saved to %s' % file_name)
jobs_crawler(5)
答案 0 :(得分:1)
这应该有所帮助。
file.write('''<a href="{0}">{1}</a><br />'''.format(link.get('href'), link.text ))
答案 1 :(得分:0)
试试这个: -
href = link.get('href') + '\n'
txt = link.get_text('href') #will give you text