我正在从事“数字人文科学”项目,试图从一系列数字化版画中分离出图像的描述。 (我一般对编码和编程还是比较陌生的,因为我只是一个谦逊的哲学家,涉足DH领域)到目前为止,我已经能够使用Python和如下所示的urllib脚本隔离源代码:>
import urllib.request
import urllib.parse
url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
print(f.read().decode('utf-8'))
但是,我的问题出现在源代码本身中。该描述与其他信息均位于同一位置,这些信息均由P和b标签分解:
</div>
<div class="col-sm-6">
<P>
<b>Book Title:</b>
<A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>
</p>
<P>
<b>Author:</b> Doré, Gustave, 1832-1883
</p>
<P>
<b>Image Title:</b> Baptism of Jesus
</p>
<P>
<b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
</p>
<P>
<b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
</P>
<P>
<A HREF="book_list.cfm?ID=2449">Click here
</a> for additional images available from this book.
</P>
<p>For information on licensing this image, please send an email, including a link to the image, to
<a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>
</p>
</div>
如何使用BeautifulSoup从这些标签中仅将描述文本隔离出来?到目前为止,我在StackOverFlow上发现的所有内容都表明它是可行的。但是,我还没有找到尝试专门进行此操作的东西。
同样,我要从源代码中仅提取描述“施洗约翰为耶稣施洗...”。我该怎么做呢?
谢谢!再次抱歉,我还没有足够的知识。
答案 0 :(得分:3)
在此示例中,我们可以使用CSS选择器。假设您使用的是BeautifulSoup 4.7+,则soupsieve库提供了CSS选择器支持。我们首先要使用:has()
CSS级别4选择器来查找具有直接子标签<p>
的{{1}}标签,然后使用汤筛的非标准<b>
选择器确保:contains
标记包含<b>
。然后,我们简单地打印所有符合此条件的元素的内容,以去除前导和尾随空格,并去除Description:
。请记住,有多种方法可以执行此操作,这只是我选择说明的方法:
Description:
输出:
import bs4
markup = """
</div>
<div class="col-sm-6">
<P>
<b>Book Title:</b>
<A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>
</p>
<P>
<b>Author:</b> Doré, Gustave, 1832-1883
</p>
<P>
<b>Image Title:</b> Baptism of Jesus
</p>
<P>
<b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
</p>
<P>
<b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
</P>
<P>
<A HREF="book_list.cfm?ID=2449">Click here
</a> for additional images available from this book.
</P>
<p>For information on licensing this image, please send an email, including a link to the image, to
<a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>
</p>
</div>
"""
soup = bs4.BeautifulSoup(markup, "html.parser")
for el in soup.select('p:has(> b:contains("Description:"))'):
print(el.get_text().strip('').replace('Description: ', ''))
答案 1 :(得分:3)
使用以下代码,我可以实现几乎想要的东西:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
soup = BeautifulSoup(f, 'html.parser')
parent = soup.find("b", text="Description:").parent
parent.find("b", text="Description:").decompose()
print(parent.text)
我添加了BeautifulSoup并删除了说明。
答案 2 :(得分:0)
我使用
标签作为索引,然后选择了[4]索引。我只是一个新手,但确实有效。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pitts.emory.edu/dia/image_details.cfm?ID=17250")
soup = BeautifulSoup(html, 'html.parser')
page = soup.find_all('p')[4].getText()
print(page)