网络报废与python美丽的汤

时间:2018-06-18 07:42:25

标签: python-3.x web-scraping beautifulsoup

这是html代码

<html>
<head></head>
<body>
<pre style="word-wrap: break-word; white-space: pre-wrap;">
"{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
</pre>
</body>
</html>

我需要废弃我需要的东西。就像我只需要作者姓名一样。

3 个答案:

答案 0 :(得分:0)

剥离标签并将json字符串转换为python dict:

import json
soup = BeautifulSoup(html)
text = soup.get_text().strip().strip('"')
d = json.loads(text)
print(d['Author'])

答案 1 :(得分:0)

@vijay print json.loads(soup.find("pre").string[2:-2])["Author"];将完成这项工作。请查看以下在Python交互式终端上执行的代码。

>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> html_text = """<html>
... <head></head>
... <body>
... <pre style="word-wrap: break-word; white-space: pre-wrap;">
... "{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
... </pre>
... </body>
... </html>"""
>>>
>>> soup = BeautifulSoup(html_text, "html.parser")
>>> print(soup.prettify())
<html>
 <head>
 </head>
 <body>
  <pre style="word-wrap: break-word; white-space: pre-wrap;">
"{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
</pre>
 </body>
</html>
>>>
>>> print(soup.find("pre"))
<pre style="word-wrap: break-word; white-space: pre-wrap;">
"{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
</pre>
>>>
>>> print(soup.find("pre").string)

"{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"

>>> print(soup.find("pre").string[2:-2])
{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}
>>>
>>> d = json.loads(soup.find("pre").string[2:-2])
>>> type(d)
<type 'dict'>
>>>
>>> d
{u'Author': u'Chetan Bhagat', u'Year': u'2016', u'Title': u'One Indian Girl'}
>>>
>>> d["Author"]
u'Chetan Bhagat'
>>>
>>> d["Year"]
u'2016'
>>>
>>> d["Title"]
u'One Indian Girl'
>>>
>>> # Place all in the list
...
>>> l = [d["Title"], d["Year"], d["Author"]]
>>> l
[u'One Indian Girl', u'2016', u'Chetan Bhagat']
>>>

»在列表中获取数据而不引用上面的字典键。

>>> final_data = [str(a.strip().split(":")[1])  for  a in soup.find("pre").string[2:-3].replace('\"', '').split(",")]
>>>
>>> final_data
['One Indian Girl', '2016', 'Chetan Bhagat']
>>>

让我们解析上面的直接程序,逐步获取数据(更新)。

>>> data = soup.find("pre").string[2:-3]
>>> data
u'{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"'
>>>
>>> data = data.replace('\"', '')
>>> data
u'{Title:One Indian Girl,Year:2016,Author:Chetan Bhagat'
>>>
>>> arr = data.split(",")
>>> arr
[u'{Title:One Indian Girl', u'Year:2016', u'Author:Chetan Bhagat']
>>>
>>> final_data = [str(a.strip().split(":")[1])  for  a in arr]
>>> final_data
['One Indian Girl', '2016', 'Chetan Bhagat']
>>>

答案 2 :(得分:0)

这就是我想要的。

exampleSoup = soup(page_html, 'html.parser')
text = exampleSoup.get_text().strip().strip('"')
elems=json.loads(text)  
Details=list(elems.values())
for i in Details:
    print(i)

elems 为我们提供字典。

我已将字典的键值对中的值设为详细信息

for循环用于分别获取每个元素。