几天来,我一直在尝试抓取特定页面,但无济于事。 我在抓取和Python方面都是菜鸟。
我确实正在寻找页面的最后一个大表,但是没有ID依赖,所以我尝试刮所有表。
我想到了以下代码:
import requests
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go"
r = requests.get(url)
r.raise_for_status()
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
tables = soup.findAll("table")
for table in tables:
row_data = []
for row in table.find_all('tr'):
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
row_data.append(cols)
print(row_data)
通过上述操作,我在打印输出(*)中收到大量垃圾,这是我两天的标准输出。
(*)即:
['12/155:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won'], ['12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won']]
答案 0 :(得分:2)
如果只想要最后一个,则可以使用表标签的索引
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url, headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
cols = row.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['Date','Time','Game','Mode','Elapsed','Won/Lost'])
df = df.iloc[1:]
print(df)