如何在python中提取特定的字符串

时间:2016-12-15 17:10:48

标签: python

我正在尝试在标记中提取特定字符串并保存它们(在此行上进行更复杂的处理)。所以说比如说,我从一个文件读取一行,当前行是:

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

但我想存储:

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'

tempWidth = 500

tempHeight = 375

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'

我将如何在Python中执行此操作?

由于

2 个答案:

答案 0 :(得分:3)

虽然你可以在这里找到几种方法,但我建议使用HTML解析器,它是可扩展的,可以处理HTML中的许多问题。以下是BeautifulSoup的一个工作示例:

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

答案 1 :(得分:0)

正则表达式方法:

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

所有值都是字符串,所以如果你想要的话就把它投出来......

并且知道使用正则表达式复制/粘贴是一个坏主意。可能很容易出错。