Python Post Web抓取问题:爬网链接用于进一步抓取

时间:2019-08-06 20:53:50

标签: python dataframe web-scraping beautifulsoup screen-scraping

有一个网页,其中包含我要抓取的表格,该表格长几行(可以不时更改),宽6列(始终)。最后一列[“ LASTLOT”]是一系列数字,在抓取(和过滤)之后,我可以在链接字符串的末尾添加以生成新的链接列[“ LINK”]。这些新生成的链接中有一个表,我也想抓一下。如何遍历我的原始链接列表以进一步进行剪贴。

我是python的新手,所以我尝试了很多事情,但我什至不知道语法是否正确...

import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import urllib2
import re
%matplotlib inline
from flask import request
from urllib import urlopen
from bs4 import BeautifulSoup
url = "[single link inserted here]" #deleted link for post
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)


table = soup.find("table",{"class":"simple_table st highlight_rows"})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)


col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)`enter code here`

df = pd.DataFrame(list_of_rows)

df1 = pd.DataFrame(all_header)

df2 = df1[0].str.split(',', expand=True)

frames = [df2, df]

df3 = pd.concat(frames)

df4 = df3.rename(columns=df3.iloc[0])

df5 = df4.drop(df4.index[0])

df5.rename(columns={'[HEX': 'HEX'},inplace=True)
df5.rename(columns={' LASTLOT]': 'LASTLOT'},inplace=True)

df5.iloc[:,2] = pd.to_numeric(df5.iloc[:,2])

df6 = df5[(df5.iloc[:,2] < 53)]     #Filtered 

df6['LINK'] = ("insert generic link string here"+df6['LASTLOT']) #deleted 
link for post

df6.head(10)

我只想从通过最初的抓取生成的链接中生成更多的数据框。

0 个答案:

没有答案