如何在我的应用程序中放置Webscraper的逻辑,以及在创建模型对象时如何触发它-Django

时间:2019-02-19 18:20:41

标签: python django web-scraping logic

我正在通过创建一个Web应用程序来练习Django,我可以用一个单词发送电子邮件,然后该应用程序将其翻译(从eng-西班牙文开始,反之亦然),然后每天给我发送几个单词来学习。

我的问题: 我不知道将翻译搜索字词的webscraper代码放在何处,也不知道如何在收到搜索字词时触发它,以便将结果添加到“结果”模型中

模型 我目前有两种型号。第一个模型包含我的搜索字词,第二个模型包含翻译结果-两者均从具有公共字段的抽象模型继承:

from django.db import models
from django.conf import settings

class CommonInfo(models.Model):
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)

class Meta:
    abstract = True

class Search(CommonInfo):
    search_term = models.CharField(max_length=100)
    user = models.ForeignKey(
        settings.AUTH_USER_MODEL,
        on_delete=models.SET_NULL,
        null=True
    )

    def __str__(self):
        return self.search_term


class Result(CommonInfo):
    search = models.ForeignKey(
        Search,
        on_delete=models.SET_NULL,
        null=True
    )
    translation = models.CharField(max_length=100)
    example = models.TextField()
    is_english = models.BooleanField(default=True)

    def __str__(self):
        return self.translation

我的视图 我的视图有一个入口,该入口接收一个HTTP POST请求,其中包含来自Sendgrid解析器的已解析电子邮件。它从主题行中提取要翻译的单词,然后将其添加到搜索模型中,并将其链接到相关用户:

from vocab.models import Search

from django.views import View
from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
from django.utils.decorators import method_decorator
import re
from users.models import CustomUser

@method_decorator(csrf_exempt, name='dispatch')
class Parser(View):

    def post(self, request, *args, **kwargs):
        #pull out the from field
        sender = request.POST.get('from')
        #regex the actual email, turn into a string and assign to result_email
        result_email = re.search("(?<=<).*?(?=>)", sender).group(0)
        #lookup to see if it exists in the DB and throw an error if not
        if CustomUser.objects.filter(email=result_email).exists() == False:
            return HttpResponse("You do not have an account, please sign up first", status=401)

        #PARSING
        # parse subject
        subject = str(request.POST.get('subject'))
        # find user ID from DB
        user = CustomUser.objects.get(email=result_email)
        Search.objects.create(search_term=subject, user=user)
        return HttpResponse("OK")

网络爬虫 我创建了一个网络抓取工具的轮廓,该轮廓应使用搜索到的单词,并从中创建一个网址(到SpanishDict网站),然后使用BeautifulSoup提取翻译和例句:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

#creates a url from the word
def url_creator(word):
    return 'https://www.spanishdict.com/translate/' + str(word).lower()

# get request using the url
def simple_get(url):

    try: 
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as error:
        log_error('Error during request to %s : %s ' % (url, error))
        return None

# checks the get request response is HTML
def is_good_response(resp):

    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None
            and content_type.find('html') > -1)

# logs an error if there are any issues
def log_error(error):
    print(error)

# creates a beautiful soup object from the raw html
def bs_html_maker(raw_html):
    return BeautifulSoup(raw_html, 'html.parser')

# finds the translation and example for the word being searched
def first_definition_finder(bs_html):
    return bs_html.find(class_="dictionary-neodict-indent-1")

# works out the language being searched (inferring it from the results of the get request)
def language_finder(bs_html):
    if bs_html.find(id="headword-and-quickdefs-es"):
        return False
    elif bs_html.find(id="headword-and-quickdefs-en"):
        return True
    else:
        raise Exception("The word you searched didn't return anything, check your spelling")

# returns  the translation, the example sentences and what language the search was in in a dictionary
def result_outputter(bs_html):
    translation_dictionary = {}
    is_english = language_finder(bs_html)
    definition_block = first_definition_finder(bs_html)
    definition = definition_block.find(class_="dictionary-neodict-translation-translation").string
    examples = examples = definition_block.find(class_="dictionary-neodict-example").strings
    example_string = "%s - %s" % (next(examples), next(examples))
    translation_dictionary["definition"] = definition
    translation_dictionary["example"] = example_string
    translation_dictionary["is_english"] = is_english
    return translation_dictionary

# pulls it all together in one method which will ideally be called whenever a search is saved to the database and the results can then be used to add the translation to the database
def vocab_translator(word):
    url = url_creator(word)
    raw_html = simple_get(url)
    bs_html = bs_html_maker(raw_html)
    return result_outputter(bs_html)

我的问题: 我不知道将翻译搜索字词的webscraper代码放在何处,也不知道如何在收到搜索字词时触发它,以便将结果添加到“结果”模型中

任何帮助将不胜感激。我目前正在学习Django,并且需要您提供任何反馈,因此对代码的任何注释也将非常有用。

0 个答案:

没有答案