Question

我从一个以

开头的beautifulSoup对象中提取了一个表。

<html><body><p>{"datasets":{"cf":"</p><table class="fs-table" id="cf-table">\n                    <tbody>\n                        <tr class="thead"><td></td><td>...

尝试将表转换为数据框时，“ \ n”弄乱了我的表

我尝试过：

soup = BeautifulSoup(res.content,'lxml')
    cleanSoup = BeautifulSoup(str(soup).replace("\n                    ", ""))
    table = cleanSoup.find_all('table')[0]

但是它不起作用..关于如何摆脱\ n的任何想法？谢谢你

Answer 1

尝试使用re模块：

import re

rx = re.compile(r"\n {1,}")
soup = BeautifulSoup(res.content,'lxml')
cleanSoup = BeautifulSoup(re.sub(rx, "", str(soup)))
table = cleanSoup.find_all('table')[0]

Answer 2

首先用'\ n'分割数据，然后去除空格，然后再加入。

from bs4 import BeautifulSoup
htmldata='''<html><body><p>{"datasets":{"cf":"</p>
<table class="fs-table" id="cf-table">\n                    <tbody>\n                        <tr class="thead"><td></td><td>...'''

htmldata="".join(item.strip() for item in htmldata.split("\n"))

soup = BeautifulSoup(htmldata,'lxml')
table = soup.find_all('table')[0]
print(table)

输出：

<table class="fs-table" id="cf-table"><tbody><tr class="thead"><td></td><td>...</td></tr></tbody></table>

希望这会有所帮助。

在BeautifulSoup对象中使用\ n清理表

2 个答案: