Question

我正在寻找类似的东西：

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''

print get_images_url_from_markdown(data)

返回文本中的图片网址列表：

['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']

有没有可用的东西，还是我必须自己用BeautifulSoup刮掉Markdown？

Answer 1

Python-Markdown有广泛的Extension API。实际上，Table of Contents扩展基本上用你想要的标题（而不是图像）加上你不需要的一堆其他东西（比如添加唯一的id属性和为TOC构建嵌套列表）。 / p>

解析文档后，它包含在ElementTree对象中，您可以使用treeprocessor在树序列化为文本之前提取所需的数据。请注意，如果您将任何图像作为原始HTML包含在内，则无法找到这些图像（在这种情况下，您需要解析HTML输出并提取）。

首先关注此tutorial，但您需要创建treeprocessor而不是内联Pattern。你应该得到这样的东西：

import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension

# First create the treeprocessor

class ImgExtractor(Treeprocessor):
    def run(self, doc):
        "Find all images and append to markdown.images. "
        self.markdown.images = []
        for image in doc.findall('.//img'):
            self.markdown.images.append(image.get('src'))

# Then tell markdown about it

class ImgExtExtension(Extension):
    def extendMarkdown(self, md, md_globals):
        img_ext = ImgExtractor(md)
        md.treeprocessors.add('imgext', img_ext, '>inline')

# Finally create an instance of the Markdown class with the new extension

md = markdown.Markdown(extensions=[ImgExtExtension()])

# Now let's test it out:

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print md.images

以上输出：

[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']

如果你真的想要一个返回列表的函数，只需将它们全部包装在一起，你就可以了。

如何从Python中的Markdown文件中获取图像URL列表？

1 个答案: