所需结果:
运行docker-compose up
之后,我希望服务器启动(docker-compose运行数据库和Web服务),然后在服务器启动并正常运行后,运行一个python脚本,该脚本包含一个Web抓取代码,该代码将将内容刮到数据库中(并且只能运行一次)。该代码使用Django模型将数据存储在DB中。
问题:
运行docker-compose up
后,数据库和服务器运行正常,我可以访问localhost上的网站并获取API响应,
但似乎从未执行过抓取脚本。我已经找到了使用以前内置于Django中的ready
函数运行代码的方法,但是它无法按预期工作(它会阻止服务器运行,直到整个抓取完成为止,这并不是我想要的最佳抓取方式在后台运行)。当我尝试使用单独的服务来运行脚本时,即使使用depends_on
也无法正常运行,该服务仍会在网络运行后立即运行,然后由于“找不到文件”而失败。这样做的正确方法是什么?
代码:
docker-compose.yml
:
version: '3.6'
services:
db:
image: postgres:10.1-alpine
volumes:
- postgres_data:/var/lib/postgresql/data/
web:
build: .
image: teonite_scraper
command: bash -c "python /teonite_webscraper/teonite_webscraper/manage.py migrate && python /teonite_webscraper/teonite_webscraper/manage.py runserver 0.0.0.0:8080 && python /teonite_webscraper/teonite_webscraper/scraper/scrape.py"
volumes:
- .:/teonite_webscraper
ports:
- 8080:8080
environment:
- SECRET_KEY=changemeinprod
depends_on:
- db
volumes:
postgres_data:
scrape.py
(抓取脚本):
import requests
from bs4 import BeautifulSoup
from teonite_webscraper.scraper.helpers import get_links
from teonite_webscraper.scraper.models import Article, Author
import json
import re
print('Starting...')
# For implementation check helpers.py, grabs all the article links from blog
links = get_links('https://teonite.com/blog/')
# List of objects to batch inject into DB to save I/Os
objects_to_inject = []
links_in_db = list(Article.objects.all().values_list('article_link', flat=True))
authors_in_db = list(Author.objects.all().values_list('author_stub', flat=True))
for link in links:
if not link in links_in_db:
# Grab article page
blog_post = requests.get(link)
# Prepare soup
soup = BeautifulSoup(blog_post.content, 'lxml')
# Gets the json with author data from page meta
json_element = json.loads(soup.find_all('script')[1].get_text())
# All of the below can be done within Articles() as parameters, but for clarity
# I prefer separate lines, and DB models cannot be accessed outside
# ready() at this stage anyway so refactoring to separate function wouldn't be possible
post_data = Article()
post_data.article_link = link
post_data.article_content = soup.find('section', class_='post-content').get_text()
# Regex only grabs the last part of author's URL that contains the "nickname"
author_stub = re.search(r'\/(\w+\-?_?\.?\w+)\/$', json_element['author']['url']).group(1)
# Check if author is already in DB if so assign the key.
if author_stub in authors_in_db:
post_data.article_author = Author.objects.get(author_stub=author_stub)
else:
# If not, create new DB Authors item and then assign.
new_author = Author(author_fullname=json_element['author']['name'],
author_stub=author_stub)
new_author.save()
# Unlike links which are unique, author might appear many times and we only grab
# them from DB once at the beginning, so adding it here to the checklist to avoid trying to
# add same author multiple times
authors_in_db.append(author_stub)
post_data.article_author = new_author
post_data.article_title = json_element['headline']
# Append object to the list and continue
objects_to_inject.append(post_data)
Article.objects.bulk_create(objects_to_inject)
print('Done collecting data!')