Question

昨天我发布了这个问题，但是每个人都建议我使用BeautifulSoup lib。我不允许在课堂上使用任何外部库，但我已经走得更远了。代码应该打开给定的网站，并在标头标签之间附加任何信息。这是一个入门课程，所以我理解我可能会问一些非常简单的事情。如何修复任何语法错误？特别的问题出在我的＆＃34; findHeader＆＃34;变量被声明。

编辑： Traceback（最近一次调用最后一次）：文件＆＃34; C：\ Users \ Cameron \ Desktop \ website header search.py＆＃34;，第16行，in if（findHeader，headerEnd）行： TypeError：＆＃39; in＆＃39;需要字符串作为左操作数，而不是元组

from urllib.request import urlopen
address = "http://www.hobo-web.co.uk/headers/"
webPage = urlopen (address)

list = []

encoding = "utf-8"
for line in webPage:
    line = str(line, encoding)
    findHeader = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')
    headerEnd = ('/h1', '/h2', '/h3', '/h4', '/h5', '/h6')
    if (findHeader, headerEnd) in line:
        start = line.index(findHeader, headerEnd) + len(findHeader, headerEnd)
        last = line.index('"', start)
        list.append(line[start : last])

webPage.close()

Answer 1

if (findHeader, headerEnd) in line:正如追溯所说，您无法检查元组是否在字符串中。我假设您正在尝试检查它们是否都在排队。这是any的工作。

if any(header in line for header in (findHeader, headerEnd)):
    do_things

Answer 2

如果你的情况很简单，我建议使用一个简单的正则表达式。

import re

line = 'I am a <h1>jedi</h1> and you are not'
regex = re.compile('<h[0-9]>(.*)</h[0-9]>')
match = regex.search(line)
if match:
    print(match.group(1))

它会返回

jedi

为了完整起见，您无法在字符串中找到元组，就像在字符串中找不到列表一样。如果你真的需要遵循这种方法，你必须检查列表中的每个元素与你的行。

Answer 3

这是另一个非常简单的问题解决方案。我相信你想要搜索匹配的标题（例如<h1>和</h1>在同一行中。这是一个非常基本的解决方案，不使用任何外部库：

findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>')

line = 'This is the <h1>header content</h1> and this is not'
for startHeader in findHeader:
    endHeader = '</'+startHeader[1:]
    if (startHeader in line) and (endHeader in line):
        content = line.split(startHeader)[1].split(endHeader)[0]
        print content

打印出来：

header content

将其放入您的代码中：

from urllib.request import urlopen

address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen (address)

encoding = "utf-8"

for line in webPage:
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>')
    line = str(line, encoding)

    for startHeader in findHeader:        
        endHeader = '</'+startHeader[1:]
        if (startHeader in line) and (endHeader in line):
            content = line.split(startHeader)[1].split(endHeader)[0]
            print (content)



webPage.close()

Answer 4

如您的错误消息所示：

TypeError：＆＃39; in＆＃39;需要字符串作为左操作数，而不是元组

当说，＆＃34; 在这个字符串中查找内容时，＆＃34;它预计你正在寻找该字符串中的字符串。如果您想检查字符串（line）是否包含多个可能选项中的至少一个（findHeader 和/或 headerEnd），您应该做一些迭代检查每个选项以查看它是否存在。

有许多方法可以在Python中处理这种迭代/检查。其中一些是单线解决方案，其中一些是几行。在我看来，最可读的方式是实际编写一个循环来迭代可能性列表（findHeader）并检查值是否存在。如果存在其中一个值，请退出循环并检查相应的结束标记是否存在（headerEnd）。

以下是对代码的修订，以便以非常易读的方式执行此类检查：

for line in webPage.split("\n"):
    line = str(line, encoding)
    findHeader = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
    headerEnd = ['/h1', '/h2', '/h3', '/h4', '/h5', '/h6']
    headerIndexNumber = -1
    for i in range(len(findHeader)):
        # Attempt to find the start of a header in the line
        if(( '<' + findHeader[i]) in line):
            # The line contains what appears to be the start of a header
            headerIndexNumber = i
            break
        # End if
    # End for

    # Check if the for loop above found a header index
    if(headerIndexNumber >= 0):
        # Great, we found a header index number in the line above
        # Now let's check for a respective closing tag.
        if(('<' + headerEnd[headerIndexNumber]) in line):
            # Cool, the line also appears to contain a closing tag for
            # the same type of header.

            ## ... <YOUR CODE HERE FOR DOING SOMETHING EITHER BETWEEN
            ## ...  OR WITH THE HTML HEADER TAGS> ...

        # End if(header closing tag was found in line)
    # End if(header start tag was found in line)
# End foreach loop (line in webPage)

显然，这只是修改后的一大块代码。如果您决定将此作为解决方案，那么您需要将它放在相关的现有代码中，并且您仍然需要编写内部逻辑（即，处理您曾经执行过的操作的代码）发现标题标记存在于行中。

尽管如此，我尝试以一种非常易读且易于理解的方式编写此代码，其中的注释解释了每行的功能。如果我上面包含的代码中的任何内容都没有意义，请发表评论，我会尝试解释它。

有关提供的代码的一些注意事项：

我在做for line in webPage.split("\n")。我在本地测试了这个，webPage设置为包含网页的原始HTML源的字符串。因此，网页源需要分成单独的行，否则for循环将遍历网页HTML中的每个字符，而不是整行。如果这与您的代码无关，只需删除拆分命令。
当我检查行中的标题标记时，我将<添加到我要查找的值之前。这是因为HTML标记始终以<开头。因此，如果该行只包含＆＃34; h1＆＃34;这将防止误报匹配。写在里面。

编辑：参考您的评论，OP，这是一种根据您发布到Pastebin的代码打印h - 代码之间的字符串的简单方法：

from urllib.request import urlopen
address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen (address)

encoding = "utf-8"

list = []

for line in webPage:
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>')
    line = str(line, encoding)
    for startHeader in findHeader:        
        endHeader = '</'+startHeader[1:]
        if (startHeader in line) and (endHeader in line):
            content = line.split(startHeader)[1].split(endHeader)[0]
            list.append(line)


for h in list:
    print((h[4:]).strip()[:-5])

webPage.close()

如何使我的if语句与此脚本一起使用？

4 个答案: