刮线程href

时间:2017-12-11 12:28:23

标签: python screen-scraping

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import requests
import re
from pyquery import PyQuery as pq
from requests.exceptions import RequestException

my_url = 'http://club.baby.sina.com.cn/forum-112-1.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
page_soup.find_all('span')

以上代码返回:

[<span><a href="http://bbs.sina.com.cn/" target="_top"><img alt="新浪网" src="http://i1.sinaimg.cn/dy/images/header/2009/standardl2nav_sina_new.gif"/></a><a href="http://bbs.sina.com.cn/"><img alt="新浪论坛" src="http://i1.sinaimg.cn/book/main/bbssitemap/xlltbbs_logo.gif"/></a></span>,
 <span class="frameswitch">
 <a href="http://club.baby.sina.com.cn/index.php">亲子论坛</a> »            

            </span>,
 <span class="" id="username_tip_text"></span>,
 <span class="postbtn" id="newspecial" onmouseover="$$('newspecial').id = 'newspecialtmp';this.id = 'newspecial';showMenu(this.id)"><a href="http://club.baby.sina.com.cn/post.php?action=newthread&amp;fid=112&amp;extra=page%3D1" title="发新话题"><img alt="发新话题" src="http://www.sinaimg.cn/IT/deco/dzbbs/images/baby_blue/newtopic.gif"/></a></span>,
 <span class="postbtn" id="newspecial" style="margin-right:10px;"></span>,
 <span class="ad_mid" id="ad_mid" style="padding-right:10px;"> </span>,
 <span id="thread_3298101"><a href="thread-3298101-1-1.html" style="color: red" target="_blank">**关于育儿专家答疑论坛的一些问题**</a></span>,
 <span id="thread_2962220"><a href="thread-2962220-1-1.html" style="color: green" target="_blank">亲子专家张思莱免费讲座通知</a></span>,
 <span id="thread_2574079"><a href="thread-2574079-1-1.html" style="font-weight: bold;color: blue" target="_blank">轮滑是幼儿时期的一项最好的运动</a></span>,
 <span id="thread_925446"><a href="thread-925446-1-1.html" style="color: orange" target="_blank">关于婴儿身上的各种“记”(转贴)</a></span>,
 <span id="thread_992843"><a href="thread-992843-1-1.html" target="_blank">关于推荐婴幼儿钙和维生素D的参考摄入量的问题</a></span>,
 <span id="thread_992833"><a href="thread-992833-1-1.html" target="_blank">婴幼儿奶粉怎么选择?</a></span>,
 <span id="thread_992832"><a href="thread-992832-1-1.html" target="_blank">关于辅食添加的问题</a></span>,
 <span id="thread_989947"><a href="thread-989947-1-1.html" target="_blank">再一次希望妈妈好好看看我的文章:如何向医生介绍孩子病情</a></span>,
 <span id="thread_989930"><a href="thread-989930-1-1.html" target="_blank">婴幼儿便秘产生的原因及对策</a></span>,
 <span id="thread_11656752"><a href="thread-11656752-1-1.html" target="_blank">工晶体是不是越贵越好</a></span>,

在这种情况下,我试图解析出来,以便我可以去每个帖子帖子来抓取每个帖子的第一个帖子。任何人都可以帮助解析这个href标签?

1 个答案:

答案 0 :(得分:0)

首先,查找所有span元素太多了 - 您的匹配程度超出了您的要求。我们将搜索范围限定为span元素,其id属性值以thread开头:

threads = page_soup.select('span[id^=thread]')

其中^=表示“以...开头”。

然后,如果你想获得主题标题,只需获取内部a元素的文本:

threads = page_soup.select('span[id^=thread]')
for thread in threads:
    print(thread.a.get_text())

打印:

**关于育儿专家答疑论坛的一些问题**
亲子专家张思莱免费讲座通知
轮滑是幼儿时期的一项最好的运动
...
剖蓓舒要火,因为一份爱!
小儿肺热有那些症状