Question

我使用os.walk递归地在文件夹中查找html文件。
这些html包含字符串。当os.walk建立列表时，我将使用BeautifulSoup提取这些字符串
我尝试下面的代码，但它不起作用：

import os 
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
    for file in files:
        if file.endswith (".html"):
           print(os.path.join(root, file))
soup = BeautifulSoup(os.path.join(root, file), "html.parser")
soup.find all('a')

如何使用文件列表作为BeautifulSoup的输入？（并在txt文件中打印输出）

Answer 1

os.path.join返回文件内容的路径，您需要open()。

import os 
from bs4 import BeautifulSoup
for root, dirs, files in os.walk ("mydir"):
    for file in files:
        if file.endswith (".html"):
            currentFile = os.path.join(root, file)
            print(currentFile)
            with open(currentFile, 'r') as html:
                soup = BeautifulSoup(html.read(), "html.parser")
                links = soup.find_all('a')
                for link in links:
                    print(link['href'])

Python-重用文件列表作为输入

1 个答案: