简单的网络刮刀格式化,我该如何解决这个问题?

时间:2014-10-31 15:29:21

标签: python beautifulsoup urllib

我有这段代码:

import requests
from bs4 import BeautifulSoup



def posts_spider():
    url = 'http://www.reddit.com/r/nosleep/new/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll('a', {'class': 'title'}):
        href = "http://www.reddit.com" + link.get('href')
        title = link.string
        print(title)
        print(href)
        print("\n")

def get_single_item_data():
    item_url = 'http://www.reddit.com/r/nosleep/new/'
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for rating in soup.findAll('div', {'class': 'score unvoted'}):
        print(rating.string)

posts_spider()
get_single_item_data()

输出结果为:

My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/


Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/


I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/


Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/


I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/


To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/


The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/


Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/


The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/


Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/


An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/


A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/


Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/


What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/


I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/


The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/


Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/


Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/


Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/


Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/


It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/


The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/


You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/


The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/


Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/


•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13

我想要做的是,将每个帖子的匹配评级放在它旁边,这样我就可以立即告诉该帖子有多少评分,而不是在1个“块”中打印标题和链接在另一个“块”中评级数字。 在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

您可以通过使用div迭代class="thing"元素来一次性完成(将其视为迭代帖子)。对于每个div,请获取链接和评分:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

def posts_spider():
    url = 'http://www.reddit.com/r/nosleep/new/'
    soup = BeautifulSoup(requests.get(url).content)
    for thing in soup.select('div.thing'):
        link = thing.find('a', {'class': 'title'})
        rating = thing.find('div', {'class': 'score'})
        href = urljoin("http://www.reddit.com", link.get('href'))

        print(link.string, href, rating.string)

posts_spider()

仅供参考,div.thing是一个CSS Selector,可将所有divclass="thing"匹配。