Question

所以我将urllib与python3配合使用，我试图从音乐歌词网站中抓取某些数据...抓取效果很好，但它也输出了某些我不喜欢的内容。我要摆脱这个。我正在使用HTMLParser btw

我已经尝试过使用正则表达式，但是它并没有达到我想要的效果，我知道我制作的抓取类肯定有问题

from urllib.request import urlopen; from html.parser import HTMLParser
link = urlopen("https://www.azlyrics.com/lyrics/lilboom/fucktaylorswift.html").read()
link = str(link)

class MyHTMLParser(HTMLParser): 
    def __init__(self):
        super().__init__()
        self.p=False
        self.pbuf=[]
    def handle_starttag(self, tag, attrs): 
        if(tag=="div"):
            self.p=True
            self.pbuf=[]
    def handle_endtag(self, tag): 
        if(tag=="div"):
            self.p=False
            print("".join(self.pbuf),flush=1)
    def handle_data(self, data): 
        if(self.p):
            data=data.replace("\\n","\n")
            data=data.replace("\\","")
            self.pbuf.append(data)


parser = MyHTMLParser()
parser.feed(link)

预期值不应总是在代码开始时包含那些不必要的字符

Answer 1

我对解析器所做的更改很少：

仅解析<div>个不包含任何属性的标签
在handle_endtag()中，我仅在self.p中有内容时才打印额外的支票，然后再重置self.pbuf
使用正则表达式删除r个字符

脚本：

import re
from urllib.request import urlopen; from html.parser import HTMLParser
link = urlopen("https://www.azlyrics.com/lyrics/lilboom/fucktaylorswift.html").read()
link = str(link)

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.p=False
        self.pbuf=[]
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag=="div" and not attrs:
            self.p=True
            self.pbuf=[]
    def handle_endtag(self, tag):
        if tag=="div" and self.p:
            self.p=False
            print("\n".join(self.pbuf),flush=1)
            self.pbuf =[]
    def handle_data(self, data):
        if(self.p):
            data=data.replace("\\n","\n")
            data=data.replace("\\","")
            data = re.sub(r'\br\b', '', data)
            self.pbuf.append(data.strip())


parser = MyHTMLParser()
parser.feed(link)

打印：

Yeah man
...and so on.

我如何从该网站抓取数据

1 个答案: