以下条件的兄弟姐妹硒python

时间:2020-08-02 10:50:58

标签: python selenium selenium-webdriver xpath

我正在尝试收集以下兄弟姐妹,直到某个兄弟姐妹,但是我仍然不知道该怎么做,我尝试使用类名在兄弟姐妹之前和之后进行查找,但结果却是错误的

我的html是:

<div class="MainClass">

        <div class="InfoClass">
            <div class="left-wrap">
              <span class="date">2 August 2020</span>
            </div>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>1</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>2</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>3</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>4</span>
            </em>
        </div>
    
        <div class="InfoClass">
            <div class="left-wrap">
              <span class="date">15 August 2020</span>
            </div>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>5</span>
            </em>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>6</span>
            </em>
        </div>
</div>

这是我的Python代码:

mainClass = driver.find_elements_by_xpath("//div[@class='MainClass']//following-sibling::div[@class='InfoClass']")

        for mc in mainClass:
            kDate = header.find_element_by_xpath(".//span[@class='date']").text
            print(kDate)
            
            datarows = header.find_elements_by_xpath("following-sibling::div[@class='DataClass' and preceding-sibling::div[@class='DataClass']]")
            
            for datarow in datarows:
                mc = datarow.find_element_by_xpath(".//em[@class='Code']").text
                print("Code : "+mc)

我得到的结果:

2 August 2020
2
3
4
5
6
15 August 2020 
5
6

作为结果,我想要的是按日期分组的“代码”类:

2 August 2020
1
2
3
4
15 August 2020 
5
6

4 个答案:

答案 0 :(得分:2)

关于您的预期输出,为什么不从所有span元素中提取文本,因为它们已经按顺序排列了?例如,使用LXML:

td1

输出:

data=tree.xpath("//span/text()")
print(*data, sep="\n")

如果您真的想使用循环并创建字典,这是一个建议。首先,数据:

2 August 2020
1
2
3
4
15 August 2020
5
6

然后输入代码:

data = """<div class="MainClass">

        <div class="InfoClass">
            <div class="left-wrap">
              <span class="date">2 August 2020</span>
            </div>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>1</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>2</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>3</span>
            </em>
        </div>
        
        <div class="DataClass">
            <em class="Code">
                <span>4</span>
            </em>
        </div>
    
        <div class="InfoClass">
            <div class="left-wrap">
              <span class="date">15 August 2020</span>
            </div>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>5</span>
            </em>
        </div>

        <div class="DataClass">
            <em class="Code">
                <span>6</span>
            </em>
        </div>
</div>"""

评论:

首先,将日期提取到列表中。然后,所有人都依赖以下 XPath (您正在寻找的那个?)来获取相应的数据类:

import lxml.html
tree = lxml.html.fromstring(data)

dates = [el.text for el in tree.xpath("//span[@class='date']")]
print(dates)

dc=[]
for els in dates:
    lists=[el.text for el in tree.xpath("//div[span[text()='"+els+"']]/../following-sibling::div[@class='DataClass']//span[preceding::span[@class='date'][1][.='"+els+"']]")]
    dc.append(lists)

print(dc)

dictionary = dict(zip(dates,dc))
print(dictionary)

//div[span[text()='"+els+"']]/../following-sibling::div[@class='DataClass']//span[preceding::span[@class='date'][1][.='"+els+"']] 是先前获取的日期。

最后,构造字典。该代码是为+els+编写的。只需将LXML替换为Selenium等价物(tree.xpath)即可使其起作用。

输出(日期,数据类,字典):

driver.find_elements_by_xpath

编辑:如果需要打印字典,可以使用:

['2 August 2020', '15 August 2020']
[['1', '2', '3', '4'], ['5', '6']]
{'2 August 2020': ['1', '2', '3', '4'], '15 August 2020': ['5', '6']}

根据要求输出:

for keys,values in dictionary.items():
    print(keys)
    print(*values,sep='\n')

答案 1 :(得分:1)

因为所有包含日期​​和数据的div在MainClass div下都处于同一级别。对于包含日期和数据的所有范围,我们都可以使用一个通用的xpath来获得理想的结果。

 driver = webdriver.Chrome()
driver.get("https://bilalzamel.htmlsave.net/")

mainClass = driver.find_elements_by_xpath("//div[@class='MainClass']//span")
for mc in mainClass:
    kDate = mc.text
    print(kDate)

答案 2 :(得分:1)

我找到了一种显示所需文本的方法。

mainClassText = driver.find_element_by_xpath("//div[@class='MainClass']").text
print(mainClassText)

如果您愿意,也可以将其转换为列表。

mainClassTextList = mainClassText.split("\n")
for ele in mainClassTextList:
    print(ele)

在两种情况下都会显示:

2 August 2020
1
2
3
4
15 August 2020
5
6

答案 3 :(得分:1)

您可以使用与上一个问题相同的简单代码,但是如果$duplicate_array = $a; shuffle($duplicate_array); $combined = array_merge($duplicate_array, $a); $combined = array_chunk($combined,2); 不是唯一的,则可以使用list来收集正确的值。如果 2020年8月2日 2020年8月15日相同,.Code

code

输出:

codes = list()
for e in driver.find_elements_by_class_name('Code'):
    code = e.text
    date = e.find_element_by_xpath("(./preceding::span[@class='date'])[last()]").text
    codes.append({"date": date, "code": code})

for c in codes:
    print(f'date: {c["date"]}, code: {c["code"]}')

如果您要使用日期作为键并将值编码为值的dict:

date: 2 August 2020, code: 1
date: 2 August 2020, code: 2
date: 2 August 2020, code: 3
date: 2 August 2020, code: 4
date: 15 August 2020, code: 5
date: 15 August 2020, code: 6

有输出:

codes = dict()
for e in driver.find_elements_by_class_name('Code'):
    code = e.text
    date = e.find_element_by_xpath("(./preceding::span[@class='date'])[last()]").text
    if date in codes:
        codes[date].append(code)
    else:
        codes.update({date: [code]})

for k, v in codes.items():
    print(f'{k} : {v}')