我有一个HTML文件,其内容如下:
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
所以我需要获取一个XPath表达式来获取每个文件的所有文本div
我写了以下内容
from lxml import html
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
tree = html.fromstring(h)
files_div = tree.xpath(r"//div[contains(text(),'File:'")
files = dict()
for file_div in files_div:
files[file_div] = file_div.xpath(r".following_sibling[not(contains(text(),'File')) and contains(text(),'Text')])
但是,使用先前的XPath表达式,它将为我获取所有文件的所有文本,而我只想获取匹配文件的文本。 XPath表达式将如何?
谢谢
答案 0 :(得分:0)
您可以使用
/*/div[contains(text(), 'File:')][1]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=1]
此XPath会在第一个包含Text
的元素之后选择所有包含单词File:
的DIV元素。
对于第二个文件,请使用
/*/div[contains(text(), 'File:')][2]/following-sibling::div[contains(text(), 'Text') and count(preceding-sibling::div[contains(text(), 'File:')])=2]
,依此类推。
因此,循环遍历包含File:
的元素数。
答案 1 :(得分:0)
对于这样的问题,我建议使用BeautifulSoup。
一个解决方案将是:
h = '''
<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
files = {}
x = soup.find('div')
current_file = ''
while True:
if 'File:' in x.text:
current_file = x.text
files[current_file] = []
else:
files[current_file].append(x.text)
x = x.find_next_sibling('div')
if x is None:
break
答案 2 :(得分:0)
您可以将BeautifulSoup
与str.split
一起使用:
from bs4 import BeautifulSoup as soup
r = [b for _, b in map(lambda x:x.text.split(': '), soup(d, 'html.parser').find_all('div'))]
输出:
['NameFile1', 'some text', 'another text', 'another text', 'NameFile2', 'some text', 'another text', 'another text']
答案 3 :(得分:0)
使用bs4 4.7.1非常简单,可以使用:contains进行过滤
如果您想要整个标签:
from bs4 import BeautifulSoup as bs
html = '''<div>File: NameFile1</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>
<div>File: NameFile2</div>
<div>Text1: some text</div>
<div>Text2: another text</div>
<div>Text3: another text</div>'''
soup = bs(html, 'lxml')
search_term = 'File: '
files_div = [i.text.replace(search_term,'') for i in soup.select(f'div:contains("{search_term}")')]
files = dict()
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')
else:
files[file_div] = soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')
print(files)
如果每个标签只需要.text
for number, file_div in enumerate(files_div):
if file_div != files_div[-1]:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"), div:contains("{file_div}") ~ div:not(div:contains("' + files_div[number+1] + '"), div:contains("' + files_div[number+1] + '") ~ div)')]
else:
files[file_div] = [i.text for i in soup.select(f'div:contains("{file_div}"),div:contains("{file_div}") ~ div')]