Question

我试图在字符串中找到属性的值。在<img src="invalidURL.com">中，如果属性/子字符串是src，我希望收到invalidURL.com。

在Violent Python中，它使用行imgSrc = imgTag['src']，它不会产生编译错误，并且脚本运行正常。（完整脚本可以在this Github repo.中找到）但是，当我尝试编写自己的脚本时，它会产生编译错误。

htmlImgTags = ['<img src="/images/icons/product/chrome-48.png"/>', '<img src="asdasd">']
for tag in htmlImgTags:
    print tag
    tagSrc = tag['src'] 
    print tagSrc

错误抱怨使用字符串作为索引而不是int。

<img src="/images/icons/product/chrome-48.png"/>
Traceback (most recent call last):
  File "looking in an array.py", line 4, in <module>
    tagSrc = tag['src'] 
TypeError: string indices must be integers, not str

我的代码到底出了什么问题，但书中没有？

Answer 1

您链接的代码似乎使用名为Beautiful Soup的库来解析HTML。该循环位于Beautiful Soup创建的标记对象列表上，而不是实际标记文本的列表。

以下是使用Beautiful Soup v3的示例：

from BeautifulSoup import BeautifulSoup

html_doc = """
<img src="/images/icons/product/chrome-48.png"/>
<img src="/images/icons/product/chrome-49.png"/>
"""

soup = BeautifulSoup(html_doc)
html_img_tags = soup.findAll("img")

for tag in html_img_tags:
  print tag['src']

输出结果为：

/images/icons/product/chrome-48.png
/images/icons/product/chrome-49.png

请注意tag 不只是一个字符串，它是一个BeautifulSoup标记对象：

>>> type(html_img_tags[0])
<class 'BeautifulSoup.Tag'>

如果您打印它，它将显示为格式良好的标签：

>>> print html_img_tags[0]
<img src="/images/icons/product/chrome-48.png" />

但这只是因为BeautifulSoup确保对象将自身转换为该字符串以便于检查。

注意：如果您的机器上碰巧有BS4，则导入行应为：

from bs4 import BeautifulSoup

... findAll()函数现在是find_all()。

Answer 2

试试这个：

import re
tag = '<img src="/images/icons/product/chrome-48.png"/>'
src = re.findall('src=(\".*?\")', tag)
print src # prints ['"/images/icons/product/chrome-48.png"']

Answer 3

这将更加无错误：

for tag in htmlImgTags:
    if tag.startswith('<img src'):
       tag = tag.split('"')[1]
       print tag

Answer 4

您链接的源代码使用名为BeautifulSoup的库来解析HTML。你似乎试图手工完成这个，我认为是出于教育目的。

您有几个选择。

就像Violent Python使用的那样，使用html解析引擎。这是推荐的方法。
另一种方法是使用正则表达式， not 建议用于解析XML。
第三个，仅当你的输入已经是你提供的形式时才有效，只是用这些来计算URL的位置和索引到字符串中。

如何在字符串中搜索子字符串值？

4 个答案: