Question

最近7个小时以来，我一直在尝试抓取项目数据。是的，它必须在没有API的情况下完成。这是一场消耗战，但是这段签出的代码不断返回nan，难道我错过了一些简单的东西吗？靠近页面底部的是首页中包含的每个故事，带有图像的小卡片，3个文章标题及其对应的链接。它要么不抓东西，要么不抓东西，要么抓完全不对的东西。应该有大约35张卡片，每张带有3个链接，可用于105个文章。我已经知道它可以识别27张带有很多nan的牌，而不是字符串，而且没有单独的文章。

import csv, requests, re, json
from bs4 import BeautifulSoup

handle = 'http://www.'
location = 'ny'
ping = handle + locaiton + 'times.com'
pong = requests.get(ping, headers = {'User-agent': 'Gordon'})
soup = BeautifulSoup(pong.content, 'html.parser')

# upper cards attempt
for i in soup.find_all('div', {'class':'css-ki19g7 e1aa0s8g0'}):
print(i.a.get('href'))
print(i.a.text)
print('')

# lower cards attempt
count = 0
for i in soup.find_all('div', {"class":"css-1ee8y2t assetWrapper"}):
    try:
        print(i.a.get('href'))
        count+=1
    except:
        pass
print('current card pickup: ', count)
print('the goal card pickup:', 35)

Clickable的所有内容都使用“ css-1ee8y2t assetWrapper”，但是当我找到find_all时，我只会得到其中的27个。我想从css-guaa7h开始，一直往下走，但它只返回nans。其他有希望但无果的div是

div class="css-2imjyh" data-testid="block-Well" data-block-tracking-id="Well"
div class="css-a11566"
div class="css-guaa7h”
div class="css-zygc9n"
div data-testid="lazyimage-container" # for images

当前尝试：

h3 class="css-1d654v4">Politics

我的希望快要耗尽了，为什么只找到第一份工作比辛苦工作更难。

Answer 1

我检查了他们的网站，并在您向下滚动时立即使用ajax加载文章。您可能必须使用硒。这是一个可能会帮助您实现的答案：https://stackoverflow.com/a/21008335/7933710

无法获得漂亮的汤来返回正确的文章标题，链接和img。帮助调试？

1 个答案: