I am processing HTML from a web crawler using BeautifulSoup. The HTML is run through filters that "simplify" the HTML, stripping and replacing tags so that the the document contains only <html>
, body
, <div>
, and <a>
tags and visible text.
I currently have a function that is currently extracting URLs and anchor text from these pages. In addition to these, I would like to also extract the N "context words" preceding and following the <a>
tag for each link. So for instance if I have the following document:
<html><body>
<div>This is <a href="www.example.com">a test</a>
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>
Then if N=8 I want to get the following 8 "context words" for each link:
'www.example.com' --> ('This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog')`
'www.petfood.com' --> ('fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad')
The first link (www.example.com
) has only two words preceding before hitting beginning of document, so those two words are returned, as well as the 6 following the <a>
tag to make the total of N=8
. Also note that the words returned cross the boundary of the <a>
tag's containing <div>
.
The second link (www.petfood.com
) has N\2
= 4 words that precede it and 4 that follow it, so those are returned as context. That is, if possible the N words are split between those preceding and those following the <a>
tag.
I know how to do this if the text is within the same <div>
as the link, but I cannot figure out how to do this across <div>
boundaries like this. Basically, for the purpose of extracting "context words", I want to treat the document as if it were just a single block of visible text with links, ignoring containing divs.
How can I extract the text surrounding <a>
tags like this using BeautifulSoup? For the sake of simplicity, I would even be satisfied with an answer that just returns the N characters of visible text preceding/following the tag (and I can just handle tokenizing/splitting myself).
答案 0 :(得分:2)
这里是一个函数,它将整个HTML代码和N作为输入,并且对于每次出现<a>
元素,都创建一个以链接URL作为第一个元素,将N个上下文词的列表作为第二个元素的元组。它返回列表中的元组。
def getContext(html,n):
output = []
soup = BeautifulSoup(html, 'html.parser')
for i in soup.findAll("a"):
n_side = int(n/2)
text = soup.text.replace('\n',' ')
context_before = text.split(i.text)[0]
words_before = list(filter(bool,context_before.split(" ")))
context_after = text.split(i.text)[1]
words_after = list(filter(bool,context_after.split(" ")))
if(len(words_after) >= n_side):
words_before = words_before[-n_side:]
words_after = words_after[:(n-len(words_before))]
else:
words_after = words_after[:n_side]
words_before = words_before[-(n-len(words_after)):]
output.append((i["href"], words_before + words_after))
return output
该函数使用BeautifulSoup解析HTML,并找到所有<a>
元素。对于每个结果,仅检索文本(使用soup.text
),并去除所有换行符。然后,使用链接文本将整个文本分为两部分。每一面都被解析为单词列表,经过过滤以消除任何空格,然后进行切片,从而最多提取N个上下文单词。
例如:
html = '''<html><body>
<div>This is <a href="www.example.com">a test</a>
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>'''
print(*getContext(html,8))
输出:
('www.example.com', ['This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog'])
('www.petfood.com', ['fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad'])
演示:https://repl.it/@glhr/55609756-link-context
编辑:请注意,此实现的一个缺陷是它使用链接文本作为分隔符来区分before
和after
。如果链接文本在HTML文档中的链接本身之前的某个地方重复出现,则可能是个问题。
<div>This test is <a href="www.example.com">test</a>
一个简单的解决方法是在链接文本中添加特殊字符以使其唯一,例如:
def getContext(html,n):
output = []
soup = BeautifulSoup(html, 'html.parser')
for i in soup.findAll("a"):
i.string.replace_with(f"[[[[{i.text}]]]]")
# rest of code here
会将<div>This test is <a href="www.example.com">test</a>
转换为<div>This test is <a href="www.example.com">[[[[test]]]]</a>
。