Question

我一直试图在网站上的特定p标签集中搜索信息并遇到很多麻烦。

我的代码如下：

import urllib   
import re

def scrape():
        url = "https://www.theWebsite.com"

        statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()

        status = re.findall(statusText,htmltext)

        print("Status: " + str(status))
scrape()

不幸的是，仅返回："Status: []"

然而，据说我不知道我做错了什么，因为当我在同一个网站上测试时，我可以使用代码

statusText = re.compile('<a href="/about">(.+?)</a>')

相反，我会得到我想要的东西，"Status: ['About', 'About']"

有谁知道我可以做些什么来获取div标签内的信息？或者更具体地说，div标签包含的单个p标签集？我已经尝试插入任何我能想到的并且无处可去的价值观。在Google，YouTube和SO搜索之后，我现在已经没有想法了。

Answer 1

我使用BeautifulSoup来提取html标签之间的信息。假设你想要提取这样一个部门：<div class='article_body' itemprop='articleBody'>...</div> 那么你可以使用beautifulsoup并通过以下方式提取这个部门：

soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})

另见bs4的官方documentation

作为一个示例，我已经编辑了您的代码，用于从article bloomberg中提取分部你可以自己做出改变

import urllib   
import re
from bs4 import BeautifulSoup

def scrape():
    url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    soup = BeautifulSoup(htmltext)
    ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
    print ans
scrape()

你可以从here

获得BeautifulSoup

P.S。：我使用scrapy和BeautifulSoup进行网页抓取，我对此感到满意

无法使用正则表达式删除网站的某些值

1 个答案: