处理不均衡的数据

时间:2017-12-31 12:04:52

标签: python selenium web-scraping

如何处理有问题的网页,以便与this

类似的方式无法正确删除数据

虽然我尝试在下面执行类似的操作而没有运气,因为页面的结构不是那么简单。我知道如何处理不平等的数据,因为网页会随机数据变得不均匀。

所需

 Azam FC v Mwenge    1.8    https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
 Western Sydney Wanderers v Melbourne City    2.87    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/
 Sydney FC v Newcastle Jets    1.53    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/

输出看起来像

 Azam FC v Mwenge    1.8    https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
 Western Sydney Wanderers v Melbourne City    1.53    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/

1.53不应该是西悉尼,而是悉尼FC

Script.py

 import collections
 import csv
 import time

 from selenium import webdriver
 from selenium.common.exceptions import TimeoutException, NoSuchElementException
 from selenium.webdriver.common.by import By
 from selenium.webdriver.support import expected_conditions as EC
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support.ui import WebDriverWait as wait

 driver = webdriver.Chrome()
 driver.set_window_size(1024, 600)
 driver.maximize_window()


 driver.get('https://www.bet365.com.au/#/AS/B1/')
 driver.get('https://www.bet365.com.au/#/AS/B1/')


 def page_counter():
     for x in range(1000):
         yield x

 count = page_counter()

 clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
 coupon_lables = [x.text for x in driver.find_elements_by_xpath('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]')]

 links = dict((next(count) + 1, e) for e in coupon_lables)
 desc_links = collections.OrderedDict(sorted(links.items(), reverse=True))
 for key, label in desc_links.items():
     driver.get('https://www.bet365.com.au/#/AS/B1/')
     clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
     driver.find_element_by_xpath(f'//div[contains(text(), "' + label + '")]').click()

     groups = '/html/body/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div/div/div[2]/div'
     xp_match_link = "//div//div[contains(@class, 'sl-CouponParticipantWithBookCloses_Name ')]"
     xp_bp1 = "//div[contains(@class, 'gl-Market_HasLabels')]/following-sibling::div[contains(@class, 'gl-Market_PWidth-12-3333')][1]//div[contains(@class, 'gl-ParticipantOddsOnly')]"

     try:
         # wait for the data to populate the tables
         wait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, (xp_bp1))))
         time.sleep(2)

         data = []
         for elem in driver.find_elements_by_xpath(groups):
             try:
                 match_link = elem.find_element_by_xpath(xp_match_link) \
                     .get_attribute('href')
             except:
                 match_link = None

             try:
                 bp1 = elem.find_element_by_xpath(xp_bp1).text
             except:
                 bp1 = None

             data.append([bp1, match_link])
             # data.append([match_link, bp1, ba1, bp3, ba3])
         print(data)
         url1 = driver.current_url

         with open('C:\\daw.csv', 'a', newline='',
                   encoding="utf-8") as outfile:
             writer = csv.writer(outfile)
             for row in data:
                 writer.writerow(row)

     except TimeoutException as ex:
         pass
     except NoSuchElementException as ex:
         print(ex)
         break

 driver.close()

1 个答案:

答案 0 :(得分:0)

如果更改以下xpath,它应该可以工作:

xp_match_link = "//div//div[contains(@class, 'sl-CouponParticipantWithBookCloses_NameContainer ')]"