BeautifulSoup文本标签之外

时间:2016-06-16 03:20:26

标签: python parsing beautifulsoup screen-scraping

我试图在本网站的每一集Seinfled中抓住所有Kramer的台词:

http://www.imsdb.com/TV/Seinfeld.html

我已经将剧集名单列入了我标记为episode-list.txt的文件

我现在正在尝试解析 KRAMER 之后的行,但它们似乎在标记之外,这是我难以理解的地方。看到这里 - > http://www.imsdb.com/transcripts/Seinfeld-Good-News,-Bad-News.html

以下是我尝试使用BeautifulSoup运行的代码。任何线索将非常感激。此外,特此征求任何未经请求的建议哈哈。如果你看到我做的任何事情都会让你感到笨拙或野蛮的编码,我会喜欢这些反馈。

干杯!

from BeautifulSoup import BeautifulSoup
import requests

text = open ("episode-list.txt","r")


for line in text.readlines():
    url = "http://www.imsdb.com/transcripts/Seinfeld-" + line.strip('\n').replace(" ", "-") + ".html"
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    for tag in soup:
            print soup.findAll('???')

1 个答案:

答案 0 :(得分:2)

这是一个代码片段,可作为参考,让您入门......

import re
from bs4 import BeautifulSoup

html = """
<b>                             KRAMER
</b>               (enters) Are you up?

<b>               
</b><b>                             JERRY
</b>               (To Kramer) Yeah...(in the phone) Yeah, 
               people do move! Have you ever seen the 
               big trucks out on the street? Yeah, 
               no problem (hangs up the phone).
<b> 
</b><b>               
</b><b>                             KRAMER
</b>               Boy, the Mets blew it tonight, huh?
"""

soup = BeautifulSoup(html, 'html.parser')
for kramer in soup.find_all('b', text=re.compile("\s+KRAMER\s+")):
    print kramer.next_sibling.strip()

输出将是......

(enters) Are you up?
Boy, the Mets blew it tonight, huh?