如何从实例变量中删除HTML元素

时间:2019-03-24 10:02:32

标签: python python-3.x

我创建了一个Web抓取工具,当我从实例变量中打印结果时,“ td”元素不会被分割。如何删除这些。我尝试过

cols = [item.replace("'<td>", "") for item in cols]

但这没用。

代码如下:

def __init__(self):
    pages = range(1, 3000, 1)
    self.url = 'https://marknadssok.fi.se/publiceringsklient?Page={}'.format(pages)

def scrape_site(self):
    #All Columns
    self.datum = []

    #Establish connection
    r = requests.get(self.url)
    html = BeautifulSoup(r.content, "html.parser")

    #Append each column to it's attribute
    table_body=html.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [x.text.strip() for x in cols]
        self.datum.append(row('td')[0:1]
    print(self.datum)

我还有几个实例变量,但是这里没有包括它们。 我添加内容的灵感来自于此处的帖子,其中有人从twitter.api抓取时使用了类似的方法。

1 个答案:

答案 0 :(得分:0)

类似这样的东西?

from bs4 import BeautifulSoup
import requests

class Test(object):
    def __init__(self):
        pages = range(1, 3, 1)
        self.url = 'https://marknadssok.fi.se/publiceringsklient?Page={}'.format(pages)
        print(pages)

    def scrape_site(self):
        #All Columns
        self.datum = []

        #Establish connection
        r = requests.get(self.url)
        html = BeautifulSoup(r.content, "html.parser")

        #Append each column to it's attribute
        table_body=html.find('tbody')
        rows = table_body.find_all('tr')

        #print('Row:', rows)

        for row in rows:
            #print("ROW: ", row)
            cols = row.find_all('td')
#            for td in cols:
#                print('COLS:', td.text)
            cols = [x.text.strip() for x in cols]
#            print("COLS2:", cols)
            self.datum.append(cols[0:1])
        print(self.datum)

def __main__():
    t = Test()
    t.scrape_site()



if __name__ == "__main__":
    __main__()