抓取的数据不会进入Postgres数据库

时间:2019-07-04 14:16:50

标签: python django web-scraping

抓取工具应抓取每个页面上的每个博客文章 刮板中的数据应进入Postgresql数据库,该数据库将计算以下统计信息:

  1. 地址/ stats下的10个最常见的单词及其数字
  2. 在地址/统计信息/ /
  3. 下可以找到每个作者的10个最常用的单词
  4. 在地址/统计///地址/作者/下提供作者的帖子

到目前为止,我专注于第一和第二个任务,但是我有两个问题(一个是另一个问题的结果,反之亦然),我不知道如何使数据进入数据库,因此我也不知道如何做“柜台”

这是我的刮刀:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter
import psycopg2
# from sqlalchemy.dialects.postgresql import psycopg2


url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])


    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)



    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome(ChromeDriverManager().install())

    contents = []
    authors = []

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)
        author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(author)
        unique_authors = list(set(authors))
        unique_contents = list(set(contents))


        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break  # for debugging
    d.quit()

    # POSTGRESQL CONNECTION
    # 1. Connect to local database using psycopg2

    hostname = 'balarama.db.elephantsql.com'
    username = 'user'
    password = 'password'
    database = 'db'

    conn = psycopg2.connect(host='domain.com', user='user',
                            password='password', dbname='db')
    conn.close()

# Counter = Counter(split_it)
#
# # most_common() produces k frequently encountered
# # input values and their respective counts.
# most_occur = Counter.most_common(10)
#
# print(most_occur)

# split() returns list of all the words in the string
# split_it = contents.split()
#
# # Pass the split_it list to instance of Counter class.
# Counter = Counter(split_it)
#
# # most_common() produces k frequently encountered
# # input values and their respective counts.
# most_occur = Counter.most_common(10)
#
# print(most_occur)

# # split() returns list of all the words in the string
# split_it = contents.split()
#
# Pass the split_it list to instance of Counter class.

型号:

from django.db import models

class author(models.Model):
    author_id = models.CharField(primary_key=True, max_length=50, editable=False)
    author_name = models.CharField(max_length=50)

    class Meta:
        ordering = ['-author_id']
        db_table = 'author'


class stats(models.Model):
    content = models.CharField(max_length=50)
    stats = models.IntegerField()

    class Meta:
        ordering = ['-stats']
        db_table = 'stats'



class authorStats(models.Model):
    author_id = models.CharField(max_length=100)
    content = models.CharField(max_length=100)
    stats = models.IntegerField()

    class Meta:
        ordering = ['stats']
        db_table = 'author_stats'

2 个答案:

答案 0 :(得分:1)

我想您会发现Django Tutorial第二部分found here非常方便。本章处理Django应用程序中的数据库连接,甚至以Postgres为例。

答案 1 :(得分:1)

首先,我怀疑如果使用django,您需要使用psycopg2包写入数据库。您可能要改用Django模型。所以这段代码是多余的:

# POSTGRESQL CONNECTION
# 1. Connect to local database using psycopg2

hostname = 'balarama.db.elephantsql.com'
username = 'user'
password = 'password'
database = 'db'

conn = psycopg2.connect(host='domain.com', user='user',
                        password='password', dbname='db')
conn.close() 

因此,如果您在应用程序中使用Django模型,则可以使用它们在postgres中存储数据。 Django具有全面的documentation,值得一看。 对于您保存authors的特定示例,它可能像这样:

scraped_author = author(name='author name')
author.save()