我正在尝试在标记中提取特定字符串并保存它们(在此行上进行更复杂的处理)。所以说比如说,我从一个文件读取一行,当前行是:
<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">
但我想存储:
tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'
tempWidth = 500
tempHeight = 375
tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'
我将如何在Python中执行此操作?
由于
答案 0 :(得分:3)
虽然你可以在这里找到几种方法,但我建议使用HTML解析器,它是可扩展的,可以处理HTML中的许多问题。以下是BeautifulSoup
的一个工作示例:
>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
... print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road
答案 1 :(得分:0)
正则表达式方法:
import re
string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]
所有值都是字符串,所以如果你想要的话就把它投出来......
并且知道使用正则表达式复制/粘贴是一个坏主意。可能很容易出错。