Question

几天来，我一直在尝试抓取特定页面，但无济于事。我在抓取和Python方面都是菜鸟。

我确实正在寻找页面的最后一个大表，但是没有ID依赖，所以我尝试刮所有表。

我想到了以下代码：

import requests
import urllib.request
from bs4 import BeautifulSoup

url = "https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go"

r = requests.get(url)
r.raise_for_status()
html_content = r.text

soup = BeautifulSoup(html_content,"html.parser")

tables = soup.findAll("table")

for table in tables:
        row_data = []
        for row in table.find_all('tr'):
            cols = row.find_all('td')
            cols = [ele.text.strip() for ele in cols]
            row_data.append(cols)
        print(row_data)

通过上述操作，我在打印输出（*）中收到大量垃圾，这是我两天的标准输出。

（*）即：

['12/155:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won'], ['12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won']]

Answer 1

如果只想要最后一个，则可以使用表标签的索引

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url,  headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['Date','Time','Game','Mode','Elapsed','Won/Lost'])
df = df.iloc[1:]
print(df)

使用BeautifulSoup刮擦具有多个表的页面

1 个答案: