Question

I am processing HTML from a web crawler using BeautifulSoup. The HTML is run through filters that "simplify" the HTML, stripping and replacing tags so that the the document contains only <html>, body, <div>, and <a> tags and visible text.

I currently have a function that is currently extracting URLs and anchor text from these pages. In addition to these, I would like to also extract the N "context words" preceding and following the <a> tag for each link. So for instance if I have the following document:

<html><body>
<div>This is <a href="www.example.com">a test</a>
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>

Then if N=8 I want to get the following 8 "context words" for each link:

'www.example.com' --> ('This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog')`

'www.petfood.com' --> ('fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad')

The first link (www.example.com) has only two words preceding before hitting beginning of document, so those two words are returned, as well as the 6 following the <a> tag to make the total of N=8. Also note that the words returned cross the boundary of the <a> tag's containing <div>.

The second link (www.petfood.com) has N\2 = 4 words that precede it and 4 that follow it, so those are returned as context. That is, if possible the N words are split between those preceding and those following the <a> tag.

I know how to do this if the text is within the same <div> as the link, but I cannot figure out how to do this across <div> boundaries like this. Basically, for the purpose of extracting "context words", I want to treat the document as if it were just a single block of visible text with links, ignoring containing divs.

How can I extract the text surrounding <a> tags like this using BeautifulSoup? For the sake of simplicity, I would even be satisfied with an answer that just returns the N characters of visible text preceding/following the tag (and I can just handle tokenizing/splitting myself).

Answer 1

这里是一个函数，它将整个HTML代码和N作为输入，并且对于每次出现<a>元素，都创建一个以链接URL作为第一个元素，将N个上下文词的列表作为第二个元素的元组。它返回列表中的元组。

def getContext(html,n):
    output = []
    soup = BeautifulSoup(html, 'html.parser')
    for i in soup.findAll("a"):
        n_side = int(n/2)

        text = soup.text.replace('\n',' ')

        context_before = text.split(i.text)[0]
        words_before = list(filter(bool,context_before.split(" ")))

        context_after = text.split(i.text)[1]
        words_after = list(filter(bool,context_after.split(" ")))

        if(len(words_after) >= n_side):
            words_before = words_before[-n_side:]
            words_after = words_after[:(n-len(words_before))]
        else:
            words_after = words_after[:n_side]
            words_before = words_before[-(n-len(words_after)):]

        output.append((i["href"], words_before + words_after))
    return output

该函数使用BeautifulSoup解析HTML，并找到所有<a>元素。对于每个结果，仅检索文本（使用soup.text），并去除所有换行符。然后，使用链接文本将整个文本分为两部分。每一面都被解析为单词列表，经过过滤以消除任何空格，然后进行切片，从而最多提取N个上下文单词。

例如：

html = '''<html><body>
<div>This is <a href="www.example.com">a test</a> 
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>'''

print(*getContext(html,8))

输出：

('www.example.com', ['This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog'])
('www.petfood.com', ['fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad'])

演示：https://repl.it/@glhr/55609756-link-context

编辑：请注意，此实现的一个缺陷是它使用链接文本作为分隔符来区分before和after。如果链接文本在HTML文档中的链接本身之前的某个地方重复出现，则可能是个问题。

<div>This test is <a href="www.example.com">test</a>

一个简单的解决方法是在链接文本中添加特殊字符以使其唯一，例如：

def getContext(html,n):
    output = []
    soup = BeautifulSoup(html, 'html.parser')
    for i in soup.findAll("a"):
        i.string.replace_with(f"[[[[{i.text}]]]]")
        # rest of code here

会将<div>This test is <a href="www.example.com">test</a>转换为<div>This test is <a href="www.example.com">[[[[test]]]]</a>。

如何选择围绕<a> tag using BeautifulSoup?

1 个答案: