使用lxml.thml和XPath用特定的文本获取直到下一个<div>的所有<div>兄弟姐妹。

时间:2019-09-12 11:03:05

标签: python html xpath lxml

我有一个HTML文件,其内容如下:

<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>

所以我需要获取一个XPath表达式来获取每个文件的所有文本div

我写了以下内容

from lxml import html
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
tree = html.fromstring(h)
files_div = tree.xpath(r"//div[contains(text(),'File:'")
files = dict()
for file_div in files_div:
    files[file_div] = file_div.xpath(r".following_sibling[not(contains(text(),'File')) and contains(text(),'Text')])

但是,使用先前的XPath表达式,它将为我获取所有文件的所有文本,而我只想获取匹配文件的文本。 XPath表达式将如何?

谢谢

4 个答案:

答案 0 :(得分:0)

您可以使用

/*/div[contains(text(), 'File:')][1]/following-sibling::div[contains(text(), 'Text')  and count(preceding-sibling::div[contains(text(), 'File:')])=1]

此XPath会在第一个包含Text的元素之后选择所有包含单词File:的DIV元素。

对于第二个文件,请使用

/*/div[contains(text(), 'File:')][2]/following-sibling::div[contains(text(), 'Text')  and count(preceding-sibling::div[contains(text(), 'File:')])=2]

,依此类推。 因此,循环遍历包含File:的元素数。

答案 1 :(得分:0)

对于这样的问题,我建议使用BeautifulSoup

一个解决方案将是:

h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(h)

files = {}
x = soup.find('div')
current_file = ''
while True:
    if 'File:' in x.text:
        current_file = x.text
        files[current_file] = []
    else:
        files[current_file].append(x.text)

    x = x.find_next_sibling('div')
    if x is None:
        break


答案 2 :(得分:0)

您可以将BeautifulSoupstr.split一起使用:

from bs4 import BeautifulSoup as soup
r = [b for _, b in map(lambda x:x.text.split(': '), soup(d, 'html.parser').find_all('div'))]

输出:

['NameFile1', 'some text', 'another text', 'another text', 'NameFile2', 'some text', 'another text', 'another text']

答案 3 :(得分:0)

使用bs4 4.7.1非常简单,可以使用:contains进行过滤

如果您想要整个标签:

from bs4 import BeautifulSoup as bs

html = '''<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''

soup = bs(html, 'lxml')
search_term = 'File: '
files_div = [i.text.replace(search_term,'') for i in soup.select(f'div:contains("{search_term}")')]
files = dict()

for number, file_div in enumerate(files_div):
    if file_div != files_div[-1]:
        files[file_div] = soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')
    else:
        files[file_div] = soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')

print(files) 

如果每个标签只需要.text

for number, file_div in enumerate(files_div):
    if file_div != files_div[-1]:
        files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')]
    else:
        files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')]