Question

我在python中使用了一个名为＆＃34; pdfminer＆＃34;把pdf文件转换为html文件。我想在pdf文件上抓取有用的信息。我怎么能在任何html文件上使用xpath和beautiful。我知道如何在网页上使用xpath和美丽的汤给出这样的链接：

# get tree
def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

如果只给出html文件，有人可以给我一些关于如何使用xpath和美味汤的例子吗？感谢

Answer 1

最终，我通过深入研究API并使用谷歌搜索找到了解决方案。通过给定的html文件作为输入，你可以在使用beautifulsoup和xpath之前获得汤或树：

soup = BeautifulSoup(open("output.html"))
doc = open("output.html", "r").read()
tree = etree.HTML(doc)

然后你可以玩汤或树来从html文件中删除你需要的内容。

在从pdf转换的html文件上使用路径和美丽的汤

1 个答案: