Question

我有以下代码但是我收到错误。我试图从Tag1和Tag2之间的html文件中获取文本没有for循环代码正在工作（对于一个文件）但是当在目录中循环时它不是

from bs4 import BeautifulSoup
from urllib import urlopen
import os
import bleach
import re
rootdir = mydirectory
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        url = file
        print url
        raw = urlopen(url).read()
        type(raw)
        Tag1 = raw.find("""<div class="song-text">""")
        Tag2 = raw.rfind("""<div style="text-align:center;padding-bottom:10px;">""")
        Cleaned = raw[Tag1+23:Tag2]
        print Cleaned

错误消息：Traceback（最近一次调用最后一次）：文件 “TestClean.py”，第12行，in raw = urlopen（url）.read（）文件“/usr/lib/python2.7/urllib.py”，第87行，在urlopen中 return opener.open（url）文件“/usr/lib/python2.7/urllib.py”，第208行，打开 return getattr（self，name）（url）文件“/usr/lib/python2.7/urllib.py”，第463行，在open_file中在open_local_file中返回self.open_local_file（url）文件“/usr/lib/python2.7/urllib.py”，第477行提出IOError（e.errno，e.strerror，e.filename）IOError：[Errno 2]没有这样的文件或目录：'paroles-a-beautiful-lie.html'

Answer 1

错误消息表示缺少文件。 os.walk仅返回文件的名称，但不返回其完整路径。 1）取path = os.path.join(subdir, file) 2）阅读文件open(path).read()而不urlopen

Answer 2

从Traceback可以清楚地看出，它无法找到“parles-a-beautiful-lie.html”。文件。我建议你一步一步走。

评论下面的代码＆＃39; print url＆＃39;。
检查您是否收到了正确的网址。
然后继续下一步 - 寻找过程。

使用python清理HTML

2 个答案: