我创建了一个Web抓取工具,当我从实例变量中打印结果时,“ td”元素不会被分割。如何删除这些。我尝试过
cols = [item.replace("'<td>", "") for item in cols]
但这没用。
代码如下:
def __init__(self):
pages = range(1, 3000, 1)
self.url = 'https://marknadssok.fi.se/publiceringsklient?Page={}'.format(pages)
def scrape_site(self):
#All Columns
self.datum = []
#Establish connection
r = requests.get(self.url)
html = BeautifulSoup(r.content, "html.parser")
#Append each column to it's attribute
table_body=html.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [x.text.strip() for x in cols]
self.datum.append(row('td')[0:1]
print(self.datum)
我还有几个实例变量,但是这里没有包括它们。 我添加内容的灵感来自于此处的帖子,其中有人从twitter.api抓取时使用了类似的方法。
答案 0 :(得分:0)
类似这样的东西?
from bs4 import BeautifulSoup
import requests
class Test(object):
def __init__(self):
pages = range(1, 3, 1)
self.url = 'https://marknadssok.fi.se/publiceringsklient?Page={}'.format(pages)
print(pages)
def scrape_site(self):
#All Columns
self.datum = []
#Establish connection
r = requests.get(self.url)
html = BeautifulSoup(r.content, "html.parser")
#Append each column to it's attribute
table_body=html.find('tbody')
rows = table_body.find_all('tr')
#print('Row:', rows)
for row in rows:
#print("ROW: ", row)
cols = row.find_all('td')
# for td in cols:
# print('COLS:', td.text)
cols = [x.text.strip() for x in cols]
# print("COLS2:", cols)
self.datum.append(cols[0:1])
print(self.datum)
def __main__():
t = Test()
t.scrape_site()
if __name__ == "__main__":
__main__()