想象一下,你在这个Twitter页面,你必须采取它的所有ID! https://twitter.com/search?l=fr&q=%23metoo%20since%3A2017-11-06%20until%3A2017-11-09&src=typd
我正在使用selenium向下滚动,直到不再剩下,然后将所有ID保存在列表中。
我害怕我的for循环并没有保存它们,我做错了什么?
2147943746
答案 0 :(得分:0)
我认为您的问题是选择器li.js-stream-item
有点过于宽泛并且包含不需要的元素。这是我在js-stream-item
类选择时得到的结果:
如您所见,第一个元素不包含您要查找的任何href
个。通过限制过滤器来解决这个问题:
tweet_selector = 'li.js-stream-item:not(.AdaptiveStreamUserGallery)'
如果我尝试一下,我会根据您的代码获得每条推文的一个ID:
scrolling down to load more tweets
23 tweets found, 0 total
INFO:root: found 23 ids
INFO:root: next id: 928321203617583104
INFO:root: next id: 928317626031407104
INFO:root: next id: 928268761890803712
INFO:root: next id: 928262618195873793
INFO:root: next id: 928239024682172416
INFO:root: next id: 928220156123385856
INFO:root: next id: 928191036681261057
INFO:root: next id: 927958439153881088
INFO:root: next id: 927957292418465793
INFO:root: next id: 927898097203761153
INFO:root: next id: 927804540031782912
INFO:root: next id: 927799255699476481
INFO:root: next id: 927779606429609984
INFO:root: next id: 927648294016339970
INFO:root: next id: 927590257297682432
INFO:root: next id: 927536130827964416
INFO:root: next id: 927523922534428672
INFO:root: next id: 927521799063130113
INFO:root: next id: 927331391091740672
INFO:root: next id: 927330842304753664
INFO:root: next id: 927330365982892033
INFO:root: next id: 927325770925604865
INFO:root: next id: 927324960175067137
另一个提示:您的代码将始终仅收集最后n
(n
中的?q=%n
)推文,因为您始终会覆盖循环中的found_tweets
列表。你必须聚合它们:
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
all_tweets = found_tweets[:]
increment = 0
while len(found_tweets) >= increment:
print('scrolling down to load more tweets')
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(delay)
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
all_tweets += found_tweets[:]
print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
increment += 10
for tweet in all_tweets:
try:
id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-1]
ids.append(id)
except StaleElementReferenceException as e:
print('lost element reference', tweet)
print(ids)