Question

通过模块urllib和我试图抓取一个网页的文本内容。我按照＆＃34; SentDex＆＃34;提供的指南进行操作。在youtube上找到这里（https://www.youtube.com/watch?v=GEshegZzt3M）和官方Python网站的文档，以拼凑一个快速的解决方案。回来的信息有很多HTML标记和我想要删除的特殊字符。我的最终结果是成功的，但我觉得它是硬编码解决方案，仅对这一场景有用。

代码如下：

url = "http://someUrl.com/dir/doc.html" #Target URL

values = {'s':'basics',
        'submit':'search'} #Set parameters for later use

data = urllib.parse.urlencode(values) #Really not sure...

data = data.encode('utf-8') #set to UTF-8

req = urllib.request.Request(url,data)#Arrange the request parameters 

resp = urllib.request.urlopen(req)#Get the document's contents matching that data type from that URL

respData = resp.read() #read the content into a variable
#BS4 method
soup = BeautifulSoup(respData, 'html.parser')
text = soup.find_all("p")
#end BS4
#re method
text = re.findall(r"<p>(.*?)</p>",str(respData)) #get all paragraph tag contents
text = str(text) #convert it to a string 
#end re
conds = ["<b>","</b>","<i>","</i>","\\","[","]","\'"] #things to remove from text

for case in conds:#for each of those things

    text = text.replace(case,"") #remove string AKA replace with nothing

是否有更有效的方法来实现消除所有＆＃34; Markup＆＃34;的最终目标？来自一个字符串，而不是每个条件的明确定义？

替换字符串python3的许多部分

0 个答案: