因此,我正在使用此URL(http://www.ancient-hebrew.org/m/dictionary/1000.html)。
下面是我的代码。
from bs4 import BeautifulSoup
import re
raw_html = open('/Users/gansaikhanshur/TESTING/webScraping/1000.html').read()
# lxml is faster. If you don't have it, pip install lxml
html = BeautifulSoup(raw_html, 'lxml')
# outputs: "http://www.ancient-hebrew.org/files/heb-anc-sm-beyt.jpg"
images = html.find_all('img', src=re.compile('.jpg$'))
for image in images:
image = re.sub(
r"..\/..\/", r"http://www.ancient-hebrew.org/", image['src'])
# print(image)
# outputs: "unicode_hebrew_text"
fonts = html.find_all('font', face="arial", size="+1")
for f in fonts:
f = f.string.strip()
print(f)
# outputs: "http://www.ancient-hebrew.org/m/dictionary/audio/998.mp3"
mp3links = html.find_all('a', href=re.compile('.mp3$'))
for mp3 in mp3links:
mp3 = "http://www.ancient-hebrew.org/m/dictionary/" + \
mp3['href'].replace("\t", '')
# print(mp3)
我正在尝试查找图像文件,文本文件和音频文件。目前,我的代码可以找到除</Font>
之后的文本以外的所有内容。例如,我试图找到e-leph
和eym
,但不确定如何做到这一点。
<A Name= 505 ></A> <IMG SRC="../../files/heb-anc-sm-pey.jpg"><IMG SRC="../../files/heb-anc-sm-lamed.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg"> <Font face="arial" size="+1"> אֶלֶף </Font> e-leph <BR> Thousand <BR> Ten times one hundred in amount or number. <BR>Strong's Number: 505 <BR><A HREF="audio/ 505 .mp3"><IMG SRC="../../files/icon_audio.gif" width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#505><Font color=A50000><B>AHLB</B></Font></A> <HR>
<A Name= 517 ></A> <IMG SRC="../../files/heb-anc-sm-mem.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg"> <Font face="arial" size="+1"> אֵם </Font> eym <BR> Mother <BR> A female parent. Maternal tenderness or affection. One who fulfills the role of a mother. <BR>Strong's Number: 517 <BR><A HREF="audio/ 517 .mp3"><IMG SRC="../../files/icon_audio.gif" width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#517><Font color=A50000><B>AHLB</B></Font></A> <HR>
因此,最后我想找到Unicode之后的所有单词,例如e-leph
和eym
答案 0 :(得分:1)
如果我们期望的输出都与问题中列出的示例相似,例如,我们可以定义一个char类([\w-])
,添加我们想要收集的所有char,然后将<\/font>
用作左边界,<br>
作为右边界。我们还将添加带有可选空格的组,我们的表达式将类似于:
<\/font>(\s+)?([\w-]+?)(\s+)?<
或
<\/font>(\s+)?([\w-]+?)(\s+)?<br>
,带有i
标志。
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"<\/font>(\s+)?([\w-]+?)(\s+)?<"
test_str = ("<A Name= 505 ></A> <IMG SRC=\"../../files/heb-anc-sm-pey.jpg\"><IMG SRC=\"../../files/heb-anc-sm-lamed.jpg\"><IMG SRC=\"../../files/heb-anc-sm-aleph.jpg\"> <Font face=\"arial\" size=\"+1\"> אֶלֶף </Font> e-leph <BR> Thousand <BR> Ten times one hundred in amount or number. <BR>Strong's Number: 505 <BR><A HREF=\"audio/ 505 .mp3\"><IMG SRC=\"../../files/icon_audio.gif\" width=\"25\" height=\"25\" border=\"0\"></A><BR> <A HREF=../ahlb/aleph.html#505><Font color=A50000><B>AHLB</B></Font></A> <HR>\n"
" <A Name= 517 ></A> <IMG SRC=\"../../files/heb-anc-sm-mem.jpg\"><IMG SRC=\"../../files/heb-anc-sm-aleph.jpg\"> <Font face=\"arial\" size=\"+1\"> אֵם </Font> eym <BR> Mother <BR> A female parent. Maternal tenderness or affection. One who fulfills the role of a mother. <BR>Strong's Number: 517 <BR><A HREF=\"audio/ 517 .mp3\"><IMG SRC=\"../../files/icon_audio.gif\" width=\"25\" height=\"25\" border=\"0\"></A><BR> <A HREF=../ahlb/aleph.html#517><Font color=A50000><B>AHLB</B></Font></A> <HR>\n")
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
如果不需要此表达式或您希望对其进行修改,请访问regex101.com。
jex.im可视化正则表达式:
答案 1 :(得分:1)
您不需要正则表达式。使用next_sibling和下面显示的css选择器。
您具有字形,字体标签,文字
使用adjacent sibling
组合器+
来获得font
标签同级,紧随img
标签之后,如上图所示。然后next_sibling
将带您进入单词。
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.ancient-hebrew.org/m/dictionary/1000.html')
soup = bs(r.content, 'lxml')
words = [item.next_sibling.strip() for item in soup.select('img + font')]
输出样本: