有一个网页,其中包含我要抓取的表格,该表格长几行(可以不时更改),宽6列(始终)。最后一列[“ LASTLOT”]是一系列数字,在抓取(和过滤)之后,我可以在链接字符串的末尾添加以生成新的链接列[“ LINK”]。这些新生成的链接中有一个表,我也想抓一下。如何遍历我的原始链接列表以进一步进行剪贴。
我是python的新手,所以我尝试了很多事情,但我什至不知道语法是否正确...
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import urllib2
import re
%matplotlib inline
from flask import request
from urllib import urlopen
from bs4 import BeautifulSoup
url = "[single link inserted here]" #deleted link for post
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
table = soup.find("table",{"class":"simple_table st highlight_rows"})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)`enter code here`
df = pd.DataFrame(list_of_rows)
df1 = pd.DataFrame(all_header)
df2 = df1[0].str.split(',', expand=True)
frames = [df2, df]
df3 = pd.concat(frames)
df4 = df3.rename(columns=df3.iloc[0])
df5 = df4.drop(df4.index[0])
df5.rename(columns={'[HEX': 'HEX'},inplace=True)
df5.rename(columns={' LASTLOT]': 'LASTLOT'},inplace=True)
df5.iloc[:,2] = pd.to_numeric(df5.iloc[:,2])
df6 = df5[(df5.iloc[:,2] < 53)] #Filtered
df6['LINK'] = ("insert generic link string here"+df6['LASTLOT']) #deleted
link for post
df6.head(10)
我只想从通过最初的抓取生成的链接中生成更多的数据框。