Selenium不会刮掉推特ID

时间:2017-11-12 16:13:13

标签: python selenium twitter

想象一下,你在这个Twitter页面,你必须采取它的所有ID! https://twitter.com/search?l=fr&q=%23metoo%20since%3A2017-11-06%20until%3A2017-11-09&src=typd

我正在使用selenium向下滚动,直到不再剩下,然后将所有ID保存在列表中。

我害怕我的for循环并没有保存它们,我做错了什么?

2147943746

1 个答案:

答案 0 :(得分:0)

我认为您的问题是选择器li.js-stream-item有点过于宽泛并且包含不需要的元素。这是我在js-stream-item类选择时得到的结果:

screenshot

如您所见,第一个元素不包含您要查找的任何href个。通过限制过滤器来解决这个问题:

tweet_selector = 'li.js-stream-item:not(.AdaptiveStreamUserGallery)'

如果我尝试一下,我会根据您的代码获得每条推文的一个ID:

scrolling down to load more tweets
23 tweets found, 0 total
INFO:root:    found 23 ids
INFO:root:    next id: 928321203617583104
INFO:root:    next id: 928317626031407104
INFO:root:    next id: 928268761890803712
INFO:root:    next id: 928262618195873793
INFO:root:    next id: 928239024682172416
INFO:root:    next id: 928220156123385856
INFO:root:    next id: 928191036681261057
INFO:root:    next id: 927958439153881088
INFO:root:    next id: 927957292418465793
INFO:root:    next id: 927898097203761153
INFO:root:    next id: 927804540031782912
INFO:root:    next id: 927799255699476481
INFO:root:    next id: 927779606429609984
INFO:root:    next id: 927648294016339970
INFO:root:    next id: 927590257297682432
INFO:root:    next id: 927536130827964416
INFO:root:    next id: 927523922534428672
INFO:root:    next id: 927521799063130113
INFO:root:    next id: 927331391091740672
INFO:root:    next id: 927330842304753664
INFO:root:    next id: 927330365982892033
INFO:root:    next id: 927325770925604865
INFO:root:    next id: 927324960175067137

另一个提示:您的代码将始终仅收集最后nn中的?q=%n)推文,因为您始终会覆盖循环中的found_tweets列表。你必须聚合它们:

found_tweets = driver.find_elements_by_css_selector(tweet_selector)
all_tweets = found_tweets[:]
increment = 0

while len(found_tweets) >= increment:
    print('scrolling down to load more tweets')
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(delay)
    found_tweets = driver.find_elements_by_css_selector(tweet_selector)
    all_tweets += found_tweets[:]

    print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
    increment += 10


for tweet in all_tweets:
    try:
        id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-1]
        ids.append(id)
    except StaleElementReferenceException as e:
        print('lost element reference', tweet)

print(ids)