使用selenium C#获取不同的标记类以保留其顺序

时间:2017-12-04 12:58:29

标签: c# html selenium

我正在抓一个ePUB圣经(每章都有HTML页面),我想保留一些分布在HTML页面上的标签的顺序。 所有的圣经章节都是相似的,并且写得像这样或者是这样的变体:

<div>
  <p class="p">
  <a class="v">1</a> This is a verse.
  <a class="v">2</a> This is a verse. 
  <a class="v">3</a> This is a verse.
  </p>
  <h3 class="s1">This is a pericope</h3>
  <h4 class="r">This is a reference for this pericope.</h4>
  <p class="p">
  <a class="v">4</a> This is a verse.
  <a class="v">5</a> This is a verse. 
  </p>
  <p class="p">
  <a class="v">6</a> This is a verse with a quote:
  </p>
  <p class="q">"This is the content</p>
  <p class="q">of a quote;</p>
  <p class="q">or a spoken word."</p>
  <h3 class="s1">This is another pericope</h3>
  ...
</div>

本质:

<class="v"> is a "a" element with the number of the verse
<class="p"> is a "p" element with a verse or a collection of verses;
<class="q"> is a blockquote;
<class="s1"> is a pericope;    
<class="r"> is a reference to a periscope;

获取所有p元素可以给我很好的结果,这与SO上的另一个相关问题相关联,但是可以废弃保留调用顺序的页面吗?

对于上面的示例,我将能够按顺序获取元素,因此我可以在文本中重写该章的内容:

  

1这是一节经文   2这是一节经文   3这是一节经文   这是一个pericope。
  这是这个pericope的参考。
  4这是一节经文   5这是一节经文   6这是一节引用的诗句:
  “这是内容
  引用;
  或口语“
  这是另一个潜望镜

     

...

似乎无法找到使用Selenium的方法,但如果它会是什么呢?

1 个答案:

答案 0 :(得分:0)

尝试此代码块并更新状态:

IList<IWebElement> elements = driver.FindElements(By.xpath("//*[(self::p) or (self::p and following-sibling::a) or (self::h3) or (self::h4)]"));
foreach (IWebElement element in elements) 
{
    string my_text = element.GetAttribute("innerHTML");
    Console.WriteLine(my_text);
}

我的控制台上的输出如下:

1 This is a verse.
2 This is a verse. 
3 This is a verse.
This is a pericope
This is a reference for this pericope.
4 This is a verse.
5 This is a verse.
6 This is a verse with a quote:
"This is the content
of a quote;
or a spoken word."
This is another pericope

更新:

请参阅xpath是正确的,并返回正确的结果。这是快照:

Bible