通过模块urllib和我试图抓取一个网页的文本内容。我按照" SentDex"提供的指南进行操作。在youtube上找到这里(https://www.youtube.com/watch?v=GEshegZzt3M)和官方Python网站的文档,以拼凑一个快速的解决方案。回来的信息有很多HTML标记和我想要删除的特殊字符。我的最终结果是成功的,但我觉得它是硬编码解决方案,仅对这一场景有用。
代码如下:
url = "http://someUrl.com/dir/doc.html" #Target URL
values = {'s':'basics',
'submit':'search'} #Set parameters for later use
data = urllib.parse.urlencode(values) #Really not sure...
data = data.encode('utf-8') #set to UTF-8
req = urllib.request.Request(url,data)#Arrange the request parameters
resp = urllib.request.urlopen(req)#Get the document's contents matching that data type from that URL
respData = resp.read() #read the content into a variable
#BS4 method
soup = BeautifulSoup(respData, 'html.parser')
text = soup.find_all("p")
#end BS4
#re method
text = re.findall(r"<p>(.*?)</p>",str(respData)) #get all paragraph tag contents
text = str(text) #convert it to a string
#end re
conds = ["<b>","</b>","<i>","</i>","\\","[","]","\'"] #things to remove from text
for case in conds:#for each of those things
text = text.replace(case,"") #remove string AKA replace with nothing
是否有更有效的方法来实现消除所有&#34; Markup&#34;的最终目标?来自一个字符串,而不是每个条件的明确定义?