Question

我正在使用python 2.7，并且我已经被分配（自我指导，我编写了这些指令）来编写一个小的静态html生成器，我想帮助找到新的面向python的资源来读取部分一次的文件。如果有人提供代码答案，那很好，但我想了解为什么和 python是如何工作的。我可以买书，但不是昂贵的 - 我可以负担得起在此特定研究中投入三十，四十美元。

该程序的工作方式是 template.html 文件， message.txt 文件，图像文件， archive.html 文件和 output.html 文件。这比您需要的信息更多，但我的基本想法是“从模板和消息中来回读取，将其内容放在输出中，然后在存档中写入输出存在”。但是我还没有到达那里，我并没有要求你解决这个问题，我详细说明如下：

该程序从 template.html 读取html，在开始标记处停止，然后从 message.txt 读取页面标题的内容。那就是我现在的位置。有用！我很高兴......几个小时前，当我意识到那不是最后的老板时。

#doctype to title
copyLine = False
for line in template.readlines():
    if not '<title>' in line:
       copyLine = True
       if copyLine:
            outputhtml.write(line)
            copyLine = False
else:
    templateSeek = template.tell()
    break

#read name of message
titleOut = message.readline()
print titleOut, " is the title of the new page"
#--------
##5. Put the title from the message file in the head>title tag of the output file
#--------
titleOut = str(titleOut)
titleTag = "<title>"+titleOut+"|Circuit Salsa</title>"
outputhtml.write(titleTag)

我的问题是：我不理解正则表达式，当我在代码中尝试各种形式的for ...时，我得到所有模板，没有模板，模板部分的某些组合我不想......无论如何，我如何来回阅读这些文件并从我离开的地方继续？任何帮助找到更容易理解的资源非常感谢，我花了大约五个小时研究这个，我很头疼，因为我不断获得针对更高级受众的资源，我不理解它们。

这是我尝试的最后两种方法（没有成功）：

block = ""
found = False
print "0"
for line in template:
    if found:
        print "1"
        block += line
        if line.strip() == "<h1>": break
else:
    if line.strip() == "</title>":
        print "2"
        found = True
        block = "</title>"

print block + "3"

只打印了第0和第3点。我把print＃放在那里因为我无法弄清楚为什么我的输出文件没有改变。

template.seek(templateSeek)
copyLine = False
for line in template.readlines():
    if not '<a>' in line:
        copyLine = True
        if copyLine:
            outputhtml.write(line)
            copyLine = False
    else:
        templateSeek = template.tell()
        break

对于另一个，我很确定我只是做错了。

Answer 1

我会使用BeautifulSoup。另一种方法是使用regular expressions，无论如何都要知道。我知道他们看起来很吓人，但他们实际上并不难学（我花了一个小时左右）。例如，要获取所有链接标记，您可以执行类似

的操作

from re import findall, DOTALL

html = '''
<!DOCTYPE html>
<html>

<head>
    <title>My awesome web page!</title>
</head>

<body>
    <h2>Sites I like</h2>
    <ul>
        <li><a href="https://www.google.com/">Google</a></li>
        <li><a href="https://www.facebook.com">Facebook</a></li>
        <li><a href="http://www.amazon.com">Amazon</a></li>
    </ul>

    <h2>My favorite foods</h2>
    <ol>
        <li>Pizza</li>
        <li>French Fries</li>
    </ol>
</body>

</html>
'''

def find_tag(src, tag):
    return findall(r'<{0}.*?>.*?</{0}>'.format(tag), src, DOTALL)

print find_tag(html, 'a')
# ['<a href="https://www.google.com/">Google</a>', '<a href="https://www.facebook.com">Facebook</a>', '<a href="http://www.amazon.com">Amazon</a>']
print find_tag(html, 'li')
# ['<li><a href="https://www.google.com/">Google</a></li>', '<li><a href="https://www.facebook.com">Facebook</a></li>', '<li><a href="http://www.amazon.com">Amazon</a></li>', '<li>Pizza</li>', '<li>French Fries</li>']
print find_tag(html, 'body')
# ['<body>\n    <h2>Sites I like</h2>\n    <ul>\n        <li><a href="https://www.google.com/">Google</a></li>\n        <li><a href="https://www.facebook.com">Facebook</a></li>\n        <li><a href="http://www.amazon.com">Amazon</a></li>\n    </ul>\n\n    <h2>My favorite foods</h2>\n    <ol>\n        <li>Pizza</li>\n        <li>French Fries</li>\n    </ol>\n</body>']

我希望你至少找到一些有用的东西。如果您有任何后续问题，请评论我的答案。祝你好运！

Answer 2

在您第一次尝试时，您会出现缩进问题。 else子句与for语句位于相同的缩进级别，因此它们一起构成了：else：control结构的复合。新的Python程序员经常对此感到困惑。 else：子句仅在for循环运行到最后而不遇到break语句时执行。显然在你的情况下，break语句会被执行，因此else：子句不会。 else：子句在循环之外，因此“found”永远不会设置为True。我想如果你缩进else：子句你会喜欢结果。另外我认为你可以放弃对strip（）的调用，而是使用像“if''in line：”等语句。

我怀疑你对第二个功能是正确的。对我来说根本没用。

Answer 3

昨晚深夜，我遇到了一个解决方案，该解决方案适用于我想要做的事情。学习正则表达式将是一个非常有用的技能，我将在今年夏天培养，正则表达式对于这个特定的应用程序来说有点多。我最后使用linecache来读取特定行，因为我想从这些文件中获取的信息是由换行符分隔的。

读取部分文件，停止并以某些单词开头

3 个答案: