Question

我正在尝试使用AngularJS从网站上的动态表中抓取数据。我正在使用Selenium抓取网站。

当前，我的问题是我无法ID动态表，因为没有标签。此外，行的ID还具有字符串形式的动态名称，这使事情变得更加复杂。任何帮助表示赞赏。

我尝试通过ID / XPATH搜索并将其添加到元素列表。没有成功。

我想要的信息包含在一个由多个参数动态生成的大型表中（收集日期）：

<tctable id="tweb_EPVisitNumber_List_1">

有多行，包含多个其他参数。以下是我感兴趣的一个小专栏示例。我想从生成的表中的所有元素中获取所有日期。

<tccol layout-xs="column" class="layout-xs-column">
<div>
<span id="web_EPVisitNumber_List_1-row-0-item-CollectionDate-label" class="componentTableItemLabel hide-gt-xs ng-binding ng-scope">Collection Date
</span>
<span class="componentTableItem ng-scope">
<span id="web_EPVisitNumber_List_1-row-0-item-CollectionDate" class="ng-binding">17/01/2019
</span>
</span>
</div>
</tccol>

随着表的进行，ID变为动态字符串，例如，其他列/行中的下一个元素将是：

id="web_EPVisitNumber_List_1-row-1-item-CollectionDate" 
id="web_EPVisitNumber_List_1-row-2-item-CollectionDate"
id="web_EPVisitNumber_List_1-row-3-item-CollectionDate"

等

我的问题是，首先我无法在较大的表中找到特定元素，而且随着字符串的动态变化，我也无法遍历该元素。

Answer 1

您将必须找到元素的一些公共属性，并基于该元素构造一个定位器。例如，在给定的示例中，所有有趣的span的ID中都带有“ CollectionDate”，但没有“ -label”（列标题中有）。
这样的xpath将是：

//span[contains(@id, "CollectionDate") and not contains(@id, "-label")]

另一种观察-所有“有趣”的都是span中的div，而tccol中的id则为//tccol/div/span[@id and not position()=1]；除了第一个是列标题：

df = pd.DataFrame({'raw_val': ['110.5M', '77M', '118.5M', '72M', '102M', '93M', '67M', '80M',
       '51M', '68M', '76.5M', '44M', '60M', '63M', '89M', '83.5M', '78M',
       '58M', '53.5M', '51.5M', '38M', '64.5M', '27M', '81M', '69.5M',
       '59.5M', '62M', '73.5M', '59M', '46M', '43M', '36M', '57M', '24M',
       '30M', '4M', '64M', '30.5M', '62.5M', '52M', '45M', '34M', '46.5M',
       '61M', '41.5M', '44.5M', '56.5M', '53M', '50M', '55M', '36.5M',
       '45.5M', '43.5M', '35M', '39M', '18M', '21.5M', '50.5M', '54M',
       '40.5M', '37.5M', '28.5M', '37M', '32M', '26M', '33M', '38.5M',
       '35.5M', '9M', '15.5M', '22M', '14M', '42.5M', '31.5M', '42M',
       '25M', '29.5M', '31M', '24.5M', '27.5M', '29M', '16.5M', '23M',
       '19M', '4.2M', '40M', '41M', '28M', '22.5M', '34.5M', '32.5M',
       '20M', '26.5M', '25.5M', '21M', '13M', '17.5M', '11.5M', '8M',
       '6M', '19.5M', '6.5M', '20.5M', '23.5M', '18.5M', '17M', '12.5M',
       '15M', '13.5M', '4.8M', '3M', '1.5M', '16M', '10M', '11M', '7M',
       '14.5M', '5.5M', '10.5M', '4.5M', '12M', '0', '9.5M', '8.5M', '2M',
       '1.7M', '1M', '3.6M', '7.5M', '3.8M', '5M', '2.4M', '2.9M', '4.7M',
       '4.1M', '2.1M', '600K', '2.7M', '3.4M', '2.5M', '3.2M', '3.1M',
       '4.9M', '4.3M', '2.3M', '525K', '3.9M', '1.8M', '2.2M', '4.4M',
       '1.6M', '900K', '3.7M', '3.5M', '1.9M', '450K', '775K', '650K',
       '750K', '2.8M', '1.3M', '4.6M', '2.6M', '1.2M', '375K', '3.3M',
       '270K', '950K', '550K', '1.1M', '975K', '1.4M', '725K', '425K',
       '210K', '875K', '675K', '325K', '800K', '850K', '160K', '120K',
       '825K', '925K', '625K', '240K', '500K', '575K', '200K', '250K',
       '700K', '350K', '475K', '300K', '70K', '140K', '230K', '400K',
       '280K', '100K', '60K', '260K', '180K', '220K', '50K', '290K',
       '90K', '150K', '40K', '130K', '190K', '170K', '110K', '30K', '80K',
       '20K', '10K']})

# get the numeric component from the string column
df['val'] = df['raw_val'].str.split('K|M').str[0].astype(float)

# get the multiplier, ie. K or M
df['multiplier'] = df['raw_val'].str[-1]

# multiply it accordingly to the multiplier, ie. 1000 for K or 1000000 for M
df['result'] = np.where(df['multiplier'] == 'K', df['val'] * 1000, df['val'] * 1000000)

Answer 2

要从生成的表中的所有元素中获取所有日期，因为这些元素是Angular元素，您需要为引入 WebDriverWait 可见的所有元素，您可以使用以下解决方案：

使用XPATH：

dates = []
date_elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//tctable[starts-with(@id, 'tweb_EPVisitNumber_List_')]//span[contains(@class,'componentTableItemLabel') and normalize-space()='Collection Date']//following::span[1]/span[starts-with(@id, 'web_EPVisitNumber_List_')]")))
for date_element in date_elements:
    dates.append(date_element.text)

优化方式：

dates = [date_element.text for date_element in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//tctable[starts-with(@id, 'tweb_EPVisitNumber_List_')]//span[contains(@class,'componentTableItemLabel') and normalize-space()='Collection Date']//following::span[1]/span[starts-with(@id, 'web_EPVisitNumber_List_')]")))]

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

如何在AngularJS网站上使用Python在Selenium中迭代和保存动态表中的信息

2 个答案: