刮板收集大部分数据而丢失少量数据

时间:2017-08-24 14:48:15

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经在python中使用selenium编写了一个脚本来从网页中删除完整的航班时刻表。在运行我的脚本后,我可以看到它到目前为止工作正常,除了一些未被解析的字段。我已经检查了数据所在的元素,但是我注意到已经删除过的元素和丢失的元素没有区别。如何获取完整内容。提前谢谢。

以下是我尝试使用的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [[item.text for item in data.find_elements_by_css_selector('td')]
                    for data in item.find_elements_by_css_selector('tr')]
for tab_data in list_of_data:
    print(tab_data)

driver.quit()

以下是数据的部分图片[缺少一个并抓取一个]: https://www.dropbox.com/s/xaqeiq97b6upj5j/flight_stuff.jpg?dl=0

以下是一个块的td元素:

<tr class="yvr-flights__row  yvr-flights__row--departed " id="226792377">
            <td>
                <time class="yvr-flights__label yvr-flights__scheduled-label yvr-flights__scheduled-label--departed notranslate" datetime="2017-08-24T06:20:00-07:00">
                    06:20
                </time>
            </td>
            <td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--departed" datetime="2017-08-24T06:20:00-07:00">
                            06:19
                        </time>
            </td>
            <td class="yvr-table__cell yvr-flights__flightNumber notranslate">AC560</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">Air Canada</td>
            <td class="yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">San Francisco</td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">
Main                
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">E87</td>

            <td class="yvr-flights__table-cell--status yvr-table__cell--nowrap">
                    <span class="yvr-flights__status yvr-flights__status--departed">Departed</span>
            </td>
            <td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap">
            </td>
            <td class="visible-until--md">
                <button class="yvr-flights__toggle-flight">Toggle flight</button>
            </td>
        </tr>

3 个答案:

答案 0 :(得分:1)

您应该只需打开此网址即可获取所有详细信息

http://www.yvr.ca/en/_api/Flights?%24filter=FlightScheduledTime%20gt%20DateTime%272017-08-24T00%3A00%3A00%27%20and%20FlightScheduledTime%20lt%20DateTime%272017-08-25T00%3A00%3A00%27%20and%20FlightType%20eq%20%27D%27&%24orderby=FlightScheduledTime%20asc

如果我转义网址,它就像

http://www.yvr.ca/en/_api/Flights?$filter=FlightScheduledTime gt DateTime'2017-08-24T00:00:00' and FlightScheduledTime lt DateTime'2017-08-25T00:00:00' and FlightType eq 'D'&$orderby=FlightScheduledTime asc

所以你应该参数化这个并根据当前日期替换日期获取JSON格式的所有数据

{
odata.metadata: "http://www.yvr.ca/_api/$metadata#Flights",
value: [
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:15:00",
FlightEstimatedTime: "2017-08-24T06:10:00",
FlightNumber: "WS560",
FlightAirlineName: "WestJet",
FlightAircraftType: "73H",
FlightDeskTo: "",
FlightDeskFrom: "",
FlightCarousel: "",
FlightRange: "D",
FlightCarrier: "WS",
FlightCity: "Calgary",
FlightType: "D",
FlightAirportCode: "YYC",
FlightGate: "B14",
FlightRemarks: "Departed",
FlightID: 226790614,
FlightQuickConnect: ""
},
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:20:00",
FlightEstimatedTime: "2017-08-24T06:19:00",

答案 1 :(得分:1)

因为您希望修复脚本而不是抓取数据。我在你的脚本中发现了一些问题。

扫描所有tr个节点。但是您感兴趣的tr应该有yvr-flights__row课程。但有一些是隐藏的,没有数据。他们有yvr-flights__row--hidden。所以你不想要它们

表的第2列也始终没有数据。如果它有更多像下面

<td class="yvr-flights__table-cell--revised notranslate">
                        <time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--early" datetime="2017-08-25T06:30:00-07:00">
                            06:20
                        </time>
            </td>

所以,当您在.text上使用td时。节点本身没有文本。但它有一个time节点,其中包含文本。有多种方法可以解决这个问题。但我使用JS来获取此类节点的内容

driver.execute_script("return arguments[0].textContent;").trim() 

因此,如果您将以下所有内容组合在一起,则可以完成所有工作

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)

item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
    [
        item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
        for item in data.find_elements_by_css_selector('td')
    ]
    for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]

for tab_data in list_of_data:
    print(tab_data)

它给我以下输出

['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '', 
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']

答案 2 :(得分:0)

正如Tarun Lalwani所建议的那样,WebDriver确实是这项活动的错误工具。

问题是webdriver只返回屏幕上可见元素的文本,因此如果要查看所有行中的数据,则需要向下滚动行并一次收集一行数据。在WebElement getText() is an empty string in Firefox if element is not physically visible on the screen中讨论过 这将是非常缓慢的。

我猜你也可以抓住textcontent而不是item.text 在java中:

item.getAttribute("textContent");

我确定python有一个等价物。

jsoup是一种替代方法,可以一次性获取数据并且速度更快