这是我第一次尝试将编程用于有用的东西,所以请耐心等待。建设性反馈非常感谢:)
我正在建立一个包含欧洲议会所有新闻稿的数据库。到目前为止,我已经构建了一个可以从一个特定URL检索我想要的数据的scraper。但是,在阅读并查看了几个教程后,我仍然无法弄清楚如何创建包含此特定网站所有新闻稿的URL列表。
也许它与网站的构建方式有关,或者我(可能)只是错过了一些经验丰富的程序会马上实现的明显的事情,但我真的不知道如何从这里。
这是起始网址:http://www.europarl.europa.eu/news/en/press-room
这是我的代码:
links = [] # Until now I have just manually pasted a few links
# into this list, but I need it to contain all the URLs to scrape
# Function for removing html tags from text
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
# Regex to match dates with pattern DD-MM-YYYY
date_match = re.compile(r'\d\d-\d\d-\d\d\d\d')
# For-loop to scrape variables from site
for link in links:
# Opening up connection and grabbing page
uClient = uReq(link)
# Saves content of page in new variable (still in HTML!!)
page_html = uClient.read()
# Close connection
uClient.close()
# Parsing page with soup
page_soup = soup(page_html, "html.parser")
# Grabs page
pr_container = page_soup.findAll("div",{"id":"website"})
# Scrape date
date_container = pr_container[0].time
date = date_container.text
date = date_match.search(date)
date = date.group()
# Scrape title
title = page_soup.h1.text
title_clean = title.replace("\n", " ")
title_clean = title_clean.replace("\xa0", "")
title_clean = ' '.join(title_clean.split())
title = title_clean
# Scrape institutions involved
type_of_question_container = pr_container[0].findAll("div", {"class":"ep_subtitle"})
text = type_of_question_container[0].text
question_clean = text.replace("\n", " ")
question_clean = text.replace("\xa0", " ")
question_clean = re.sub("\d+", "", question_clean) # Redundant?
question_clean = question_clean.replace("-", "")
question_clean = question_clean.replace(":", "")
question_clean = question_clean.replace("Press Releases"," ")
question_clean = ' '.join(question_clean.split())
institutions_mentioned = question_clean
# Scrape text
text_container = pr_container[0].findAll("div", {"class":"ep-a_text"})
text_with_tags = str(text_container)
text_clean = remove_tags(text_with_tags)
text_clean = text_clean.replace("\n", " ")
text_clean = text_clean.replace(",", " ") # Removing commas to avoid trouble with .csv-format later on
text_clean = text_clean.replace("\xa0", " ")
text_clean = ' '.join(text_clean.split())
# Calculate word count
word_count = len(text_clean.split())
word_count = str(word_count)
print("Finished scraping: " + link)
time.sleep(randint(1,5))
f.write(date + "," + title + ","+ institutions_mentioned + "," + word_count + "," + text_clean + "\n")
f.close()
答案 0 :(得分:1)
以下是获取python-requests
和lxml
所需链接列表的简单方法:
from lxml import html
import requests
url = "http://www.europarl.europa.eu/news/en/press-room/page/"
list_of_links = []
for page in range(10):
r = requests.get(url + str(page))
source = r.content
page_source = html.fromstring(source)
list_of_links.extend(page_source.xpath('//a[@title="Read more"]/@href'))
print(list_of_links)
答案 1 :(得分:1)
您只需使用六个密码即可使用requests
和BeautifulSoup
获取链接。尽管该脚本与Sir Andersson大致相同,但此处应用的库和用法略有不同。
import requests ; from bs4 import BeautifulSoup
base_url = "http://www.europarl.europa.eu/news/en/press-room/page/{}"
for url in [base_url.format(page) for page in range(10)]:
soup = BeautifulSoup(requests.get(url).text,"lxml")
for link in soup.select('[title="Read more"]'):
print(link['href'])
答案 2 :(得分:0)
编辑:可以在不使用selenium模块的情况下获得前15个URL。
您无法使用urllib.request(我认为这是您正在使用的)来获取新闻稿的网址,因为此网站的内容是动态加载的。
您可以尝试使用selenium模块。
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
driver.get('http://www.europarl.europa.eu/news/en/press-room')
# Click "Load More", repeat these as you like
WebDriverWait(driver, 50).until(EC.visibility_of_element_located((By.ID, "continuesLoading_button")))
driver.find_element_by_id("continuesLoading_button").click()
# Get urls
soup = BeautifulSoup(driver.page_source)
urls = [a["href"] for a in soup.select(".ep_gridrow-content .ep_title a")]
答案 3 :(得分:0)
您可以阅读官方BeautifulSoup documentation以更好地抓取。您还应该查看Scrapy。
这是一个简单的片段,用于从该页面抓取所需的链接 我在以下示例中使用Requests库。如果您有任何其他疑问,请与我们联系。
虽然这个脚本不会点击&#34;加载更多&#34;,然后加载其他版本。
我会把它留给你;)(提示:使用Selenium或Scrapy)
def scrape_press(url):
page = requests.get(url)
if page.status_code == 200:
urls = list()
soup = BeautifulSoup(page.content, "html.parser")
body = soup.find_all("h3", {"class": ["ep-a_heading", "ep-layout_level2"]})
for b in body:
links = b.find_all("a", {"title": "Read more"})
if len(links) == 1:
link = links[0]["href"]
urls.append(link)
# Printing the scraped links
for _ in urls:
print(_)
注意:在抓取任何数据之前,您应该始终阅读网站的条款和条件。