用python抓图片和文字

时间:2014-01-12 00:57:06

标签: python web-scraping beautifulsoup

使用python,你会如何从网站上抓取图片和文字。例如,我想要删除图片和文本here,我会使用哪些python工具/库?任何教程?

2 个答案:

答案 0 :(得分:1)

请不要使用正则表达式,而不是解析html。

通常我会使用以下工具组合:

  • 请求模块
  • lxml.html
  • beautifulsoup4检测网站编码

一种方法看起来像这样,我希望你能得到这个想法(代码只是说明了概念,没有经过测试,赢得了工作):

import lxml.html
import requests
from cssselect import HTMLTranslator, SelectorError
from bs4 import UnicodeDammit

# first do the http request with requests module like
r = requests.get('http://example.com')
html = r.read()

# Try to parse/decode the HTML result with lxml and beautifoulsoup4
try:
    doc = UnicodeDammit(html, is_html=True)
    parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding)
    dom = lxml.html.document_fromstring(html, parser=parser)
    dom.resolve_base_href()
except Exception as e:
    print('Some error occured while lxml tried to parse: {}'.format(e.msg))
    return False

# Try to extract all data that we are interested in with CSS selectors!
try:
    results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM'))
    for e in results:
        # access elements like
        print(e.get('href')) # access href attribute
        print(e.text_content()) # the content as text
        # or process further
        found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child'))
except Exception as e:
    print(e.__cause__)

答案 1 :(得分:0)

requestsscrapyBeatidulSoup

Scrapy是可选的,但请求正在成为非官方标准,我还没有看到比BS更好的解析工具。