我想从this book中提取一首随机诗。
使用BeautifulSoup,我能够找到标题和散文。
print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text
但我想找到所有的诗并挑选一首。
我应该使用正则表达式并匹配所有之间
<h3>
和</span></p>
?
答案 0 :(得分:0)
假设您已经有一个合适的soup
对象可供使用,以下内容可能会帮助您入门:
poem_ids = []
for section in soup.find_all('ol', class_="TOC"):
poem_ids.extend(li.find('a').get('href') for li in section.find_all('li'))
poem_ids = [id[1:] for id in poem_ids[:-1] if id]
poem_id = random.choice(poem_ids)
poem_start = soup.find('a', id=poem_id)
poem = poem_start.find_next()
poem_text = []
while True:
poem = poem.next_element
if poem.name == 'h3':
break
if poem.name == None:
poem_text.append(poem.string)
print '\n'.join(poem_text).replace('\n\n\n', '\n')
首先从页面顶部的目录中提取诗歌列表。这些包含每首诗的唯一ID。接下来,选择随机ID,然后根据该ID提取匹配的诗。
例如,如果选择了第一首诗,您将看到以下输出:
"The Arrow and the Song," by Longfellow (1807-82), is placed first in
this volume out of respect to a little girl of six years who used to
love to recite it to me. She knew many poems, but this was her
favourite.
I shot an arrow into the air,
It fell to earth, I knew not where;
For, so swiftly it flew, the sight
Could not follow it in its flight.
I breathed a song into the air,
It fell to earth, I knew not where;
For who has sight so keen and strong
That it can follow the flight of song?
Long, long afterward, in an oak
I found the arrow, still unbroke;
And the song, from beginning to end,
I found again in the heart of a friend.
Henry W. Longfellow.
这是通过使用BeautifulSoup从每个元素中提取所有文本,直到找到下一个<h3>
标记,然后删除任何额外的换行符来完成的。
答案 1 :(得分:0)
请改用html document parser。就意外的后果而言,它更安全。
所有程序员不鼓励用正则表达式解析HTML的原因是页面的HTML标记不是静态的,特别是如果你的源HTML是一个网页。正则表达式更适合字符串。
使用正则表达式需要您自担风险。