我使用os.walk递归地在文件夹中查找html文件。
这些html包含字符串。当os.walk建立列表时,我将使用BeautifulSoup提取这些字符串
我尝试下面的代码,但它不起作用:
import os
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
for file in files:
if file.endswith (".html"):
print(os.path.join(root, file))
soup = BeautifulSoup(os.path.join(root, file), "html.parser")
soup.find all('a')
如何使用文件列表作为BeautifulSoup的输入? (并在txt文件中打印输出)
答案 0 :(得分:1)
os.path.join
返回文件内容的路径,您需要open()
。
import os
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
for file in files:
if file.endswith (".html"):
currentFile = os.path.join(root, file)
print(currentFile)
with open(currentFile, 'r') as html:
soup = BeautifulSoup(html.read(), "html.parser")
links = soup.find_all('a')
for link in links:
print(link['href'])