Question

我有一个json值，我需要删除所有html标签。

使用以下功能后：

def payloaded():
from urllib.request import urlopen
with urlopen("www.example.com/payload.json") as r:
    data = json.loads(r.read().decode(r.headers.get_content_charset("utf-8")))
text = (data["body"]["und"][0]["value"])
return(text)

这是返回的（文字）：

&lt;div class=&quot;blah&quot;&gt;'<p>This is the text.</p>\r\n'

这是原文（文字）：

<div class="blah"><p>This is the text.</p>

注意：我需要剥离所有html标签，并且没有关于我将获得的标签的真正指导。

这就是我想要的（文字）：

This is the text.

这是我正在使用的帖子功能：

def add_node_basic(text)
url = "www.example.com"
headers = {"content-type": "application/json"}
payload = {
    "auth_token": x,
    "docs":
        {
            "id": y,
            "fields": [
                {"name": "body", "value": text, "type": "text"},
            ]}
}

r = requests.post(url, data=json.dumps(payload), headers=headers)

非常感谢有关如何实现这一目标的任何建议！

Answer 1

您可以尝试切割字符串以及find方法，如下所示：

>>> print text[text.find('<p>'):text.find('</p>')].strip('<p>')
This is the text.

如果您尝试仅从HTML源提取文本，则可以在Python中使用HTMLParser库。例如：

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

获取json有效负载然后使用python剥离html

1 个答案: