Question

我正在尝试在标记中提取特定字符串并保存它们（在此行上进行更复杂的处理）。所以说比如说，我从一个文件读取一行，当前行是：

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

但我想存储：

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'

tempWidth = 500

tempHeight = 375

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'

我将如何在Python中执行此操作？

由于

Answer 1

虽然你可以在这里找到几种方法，但我建议使用HTML解析器，它是可扩展的，可以处理HTML中的许多问题。以下是BeautifulSoup的一个工作示例：

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

Answer 2

正则表达式方法：

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

所有值都是字符串，所以如果你想要的话就把它投出来......

并且知道使用正则表达式复制/粘贴是一个坏主意。可能很容易出错。

如何在python中提取特定的字符串

2 个答案: