如何按降序刮取和保存数据,例如-created

时间:2016-06-03 19:30:56

标签: python django database web-scraping beautifulsoup

我现在能够抓取数据并将其保存到我的数据库中。但问题是,当我显示它时,它是最早的。我试过这个:

def scrape_and_store_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [
        {'href': url + box.a.get('href'),
         'src': box.img.get('src'),
         'text': box.strong.a.text} 
        for box in titles
    ]

    for entry in entries.reverse():
        post = Post()
        post.title = entry['text']
        post.image_url = entry['src']
        post.status = 'published'
        post.save()
    return entries

但我收到了这个错误:

'NoneType' object is not iterable

我也试过这个:

def scrape_and_store_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [
        {'href': url + box.a.get('href'),
         'src': box.img.get('src'),
         'text': box.strong.a.text} 
        for box in titles
    ]

    for entry in entries:
        post = Post()
        post.title = entry['text']
        post.image_url = entry['src']
        post.status = 'published'
        ordering = ('-publish',)
        post.save()
    return entries

没什么。

def scrape_and_store_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [
        {'href': url + box.a.get('href'),
         'src': box.img.get('src'),
         'text': box.strong.a.text} 
        for box in titles
    ]

    for entry in entries.reverse():
        post = Post()
        post.title = entry['text']
        post.image_url = entry['src']
        post.status = 'published'
        post.ordering = ['id']
        post.save()
    return entries

这些方法都不起作用。我觉得这与它有关:

for entry in entries:
    post = Post()
    post.title = entry['text']
    post.image_url = entry['src']
    post.status = 'published'
    post.save()

因为我有一个完全重复的函数减去"用于条目中的条目"循环上面,订购很好。它看起来像这样:

def scrape_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [
        {'href': url + box.a.get('href'),
         'src': box.img.get('src'),
         'text': box.strong.a.text} 
        for box in titles
    ]

    return entries

如何调整语法以便按顺序存储和显示订单?

编辑:我的models.py

class Post(models.Model):

  STATUS_CHOICES = (
     ('draft', 'Draft'),
     ('published', 'Published'),
  )
  title = models.CharField(max_length=250, unique=True)
  slug = models.SlugField(max_length=250,
                          unique_for_date='publish')
  image = models.ImageField(upload_to=upload_location,
                            null=True,
                            blank=True,
                            height_field='height_field',
                            width_field='width_field')
  image_url = models.CharField(max_length=500,
                               null=True,
                               blank=True,
                               )
  height_field = models.IntegerField(default=0,
                                     null=True,
                                     blank=True,
                                     )
  width_field = models.IntegerField(default=0,
                                    null=True,
                                    blank=True,
                                    )
  author = models.ForeignKey(User,
                             related_name='blog_posts',
                             null=True,
                             blank=True,)
  body = models.TextField(null=True, blank=True,)
  publish = models.DateTimeField(default=timezone.now)
  created = models.DateTimeField(auto_now_add=True)
  updated = models.DateTimeField(auto_now=True)
  status = models.CharField(max_length=10,
                            choices=STATUS_CHOICES,
                            default='draft')
  video = models.BooleanField(default=False)
  video_path = models.CharField(max_length=320,
                                null=True,
                                blank=True,)

  class Meta:
      ordering = ('-publish',)

  def __str__(self):
      return self.title

  def get_absolute_url(self):
      return reverse('blog:post_detail', kwargs={"slug": self.slug})

  objects = models.Manager() # The default manager.
  published = PublishedManager() # Our custom manager.
  tags = TaggableManager(blank=True)

2 个答案:

答案 0 :(得分:0)

按照问题标记,您正在使用Django。而且我还假设Post是Django模型。如果这是正确的,并且您希望在查找Post实例时定义默认排序,则必须使用Meta options ordering在模型类中执行此操作。类似的东西:

class Post(Model):
    class Meta:
        ordering = ['-id']

    # Field definitions...

您还可以考虑添加DateTimeField并激活auto_now_add选项,并按该字段进行排序。

答案 1 :(得分:0)

颠倒()方法是有效的。

reverse() 

不会迭代

所以我的代码来自这个

def scrape_and_store_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in titles]

    for entry in entries.reverse():
        post = Post()
        post.title = entry['text']
        post.image_url = entry['src']
        post.status = 'published'
        post.save()
    return entries

到这个

def scrape_and_store_world():
    url = 'http://www.worldstarhiphop.com'
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    titles = soup.find_all('section', 'box')[:9]
    entries = [{'href': url + box.a.get('href'),
                'src': box.img.get('src'),
                'text': box.strong.a.text,
                } for box in titles]

    entries = entries.__reversed__()

    for entry in entries.reverse():
        post = Post()
        post.title = entry['text']
        post.image_url = entry['src']
        post.status = 'published'
        post.save()
    return entries

我希望这有助于某人