Python正则表达式没有返回我正在寻找的东西

时间:2016-02-06 15:19:15

标签: python regex urllib

我正在抓取一个网站,并希望获取特定标签内的内容。 我想要获取内容的标记是: <pre class="js-tab-content"></pre>

这是我的代码:

request = urllib.request.Request(url=url)
response = urllib.request.urlopen(request)
content = response.read().decode()

tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content)

print(tab)

当我打印标签时,我得到一个空列表[]

以下是我要搜索的内容:

.... <pre class="js-tab-content"><i></i><span>Em</span>              <span>D</span>              <span>Em</span>             <span>D</span>

Lift M
ac Cahir Og your face, brooding o'er the old disgrace 

     <span>Em</span>                  <span>D</span>                       <span>G</span>-<span>D</span>-<span>Em</span>     

That black Fitzwilliam stormed your place and drove you to the Fern.

<span>Em</span>              <span>D</span>           <span>Em</span>                         <span>D</span>

Gray said victory was sure, soon the firebrand he'd secure

<span>Em</span>                <span>D</span>          <span>G</span>-<span>D</span>-<span>Em</span>

Until he met at Glenmalure, Feach Mac Hugh O'Byrne 



Chorus:

<span>G</span>                                <span>D</span>

Curse and swear, Lord Kildare, Feach will do what Feach will dare

<span>G</span>                               <span>G</span>-<span>D</span>-<span>Em</span>

Now Fitzwilliam have a care, fallen is your star low

<span>G</span>                                       <span>D</span> 

Up with halbert, out with sword, on we go for by the Lord

<span>G</span>                               <span>G</span>-<span>D</span>-<span>Em</span>

Feach Mac Hugh has given his word: Follow me up to Carlow 



From Tassagart ____to Clonmore flows a stream of Saxon Gore

Great is Rory Og O'More at sending loons to Hades.

White is sick and Lane is fled, now for black Fitzwilliams head

We'll send it over, dripping red, to Liza and her ladies



See the swords of Glen Imayle flashing o'er the English Pale

See all the children of the Gael, beneath O'Byrne's banners

Rooster of the fighting stock, would you let an Saxon cock

Crow out upon an Irish rock, fly up and teach him manners

</pre> ....

我不明白为什么这会返回一个空列表而不是列表中包含内容的字符串。

我在互联网上看了大约半个小时,找不到任何帮助。

对不起,如果我在这里看起来很蠢,如果它是如此明显!

无论如何,提前谢谢!

2 个答案:

答案 0 :(得分:5)

好的,要添加到评论中,以下是在这种情况下您可以使用BeautifulSoup HTML Parser 提取pre文字的方法:

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
print(soup.find("pre", class_="js-tab-content").get_text())

答案 1 :(得分:2)

apply
tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content, re.S) 需要

re.S才能匹配换行符。