Question

我需要抓一个网页（https://www304.americanexpress.com/credit-card/compare），但我遇到了一个问题 - 我在首页上需要的文字完全隐藏在许多不同的格式标签中。

我知道如何使用Beautiful Soup刮一个常规页面，但这并没有给我我想要的东西（即文本丢失，一些标签通过......）

import requests
from bs4 import BeautifulSoup
from collections import Counter


urls = ['https://www304.americanexpress.com/credit-card/compare']

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print (''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))

是否有一种特殊的方式来抓取这个特定的网页？

Answer 1

这只是一个常规网页。例如，<span class="card-offer-des">包含文本after you use your new Card to make $1,000 in purchases within the first 3 months.。我也尝试在浏览器中关闭Javascript。文本仍然应该存在。

所以我真的没看到问题所在。另外，我建议尝试学习lxml和xpath。一旦你知道它是如何工作的，它实际上更容易得到你想要的文本。

Answer 2

你应该尝试使用python的代码是：

if not "what-have-you" in StringPulledFromSite: continue;
if "what-have-you" in StringPulledFromSite:
[your code to save to the filesystem];

你应该瞄准的字符串是这样的：

((<span class=\") && (/>))

你应该尝试找到两者（并尝试具体，以便你可以轻松地区分它们）。找到两者后，保存字符串，测试并保存文本。

刮掉隐藏在Python 3中的标记内的网页上的所有文本

2 个答案: