Question

apache中index.html的内容是：

<html>

<head>
<title>Title</title><meta charset="utf8">
</head>

<body>
<p id="show_p">{ "Java": "ad5aedf87c4f591aa51e02daaea31717ee0798cf-40", "Python": "b6525442fc002ca1ea255e90286ab57afd1c952a-12", "Shell": "12d6180f298ab6419c34d6543aca593d81ec446e-10", "JavaScript": "b6525442fc002ca1ea255e90286ab57afd1c952a-13", "C": "6ad83ed9f599a8c9c967ef2f7168127f8dee28f6-229" }</p>
<pre id="out_pre"></pre>

</body>

<script type="text/javascript">

var text = document.getElementById('show_p').innerText;

document.getElementById('show_p').innerText = ''

var result = JSON.stringify(JSON.parse(text), null, 2);

document.getElementById('out_pre').innerText= result ;

</script>

</html>

现在index.html的数据是json格式。

我的python脚本是这样的：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
import json
import os

req = urllib2.Request('http://127.0.0.1')
response = urllib2.urlopen(req)
the_page = response.read()
print (the_page)
dictionfo = json.loads(the_page)

print（the_page）打印index.html

中的所有内容

目前我只想在index.html中获取body的内容，python脚本应该如何编写？

Answer 1

您可以使用Beautiful Soup库。

在json.loads之前添加这些行。

from bs4 import BeautifulSoup
soup = BeautifulSoup(the_page, 'html.parser')
the_page = soup.p.text

Answer 2

建议：

from lxml import html, etree

# page loading ...

doc = html.fromstring(the_page) # parse the page to html object
print(etree.tostring(doc.body)) # printing the body

通过这个，您可以作为属性访问页面的不同部分，或者使用xpath选择目标元素（例如）：

doc.xpath(./body/div')

将返回一个列表，其中所有div都直接在body元素内。要访问这些对象的属性，您可以使用get('<attributeName>')。

如何使用python脚本获取apache的家庭数据？

2 个答案: