我试图在本网站的每一集Seinfled中抓住所有Kramer的台词:
http://www.imsdb.com/TV/Seinfeld.html
我已经将剧集名单列入了我标记为episode-list.txt的文件
我现在正在尝试解析 KRAMER 之后的行,但它们似乎在标记之外,这是我难以理解的地方。看到这里 - > http://www.imsdb.com/transcripts/Seinfeld-Good-News,-Bad-News.html
以下是我尝试使用BeautifulSoup运行的代码。任何线索将非常感激。此外,特此征求任何未经请求的建议哈哈。如果你看到我做的任何事情都会让你感到笨拙或野蛮的编码,我会喜欢这些反馈。
干杯!
from BeautifulSoup import BeautifulSoup
import requests
text = open ("episode-list.txt","r")
for line in text.readlines():
url = "http://www.imsdb.com/transcripts/Seinfeld-" + line.strip('\n').replace(" ", "-") + ".html"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup:
print soup.findAll('???')
答案 0 :(得分:2)
这是一个代码片段,可作为参考,让您入门......
import re
from bs4 import BeautifulSoup
html = """
<b> KRAMER
</b> (enters) Are you up?
<b>
</b><b> JERRY
</b> (To Kramer) Yeah...(in the phone) Yeah,
people do move! Have you ever seen the
big trucks out on the street? Yeah,
no problem (hangs up the phone).
<b>
</b><b>
</b><b> KRAMER
</b> Boy, the Mets blew it tonight, huh?
"""
soup = BeautifulSoup(html, 'html.parser')
for kramer in soup.find_all('b', text=re.compile("\s+KRAMER\s+")):
print kramer.next_sibling.strip()
输出将是......
(enters) Are you up?
Boy, the Mets blew it tonight, huh?