我正在解析带有CSS样式的html标记。
<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 60.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 80.0%;"></span>
<td>Compensation</td>
</table>
<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 100.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 40.0%;"></span>
<td>Compensation</td>
</table>
我需要获取以下数字:60、80、100、40个数字才能写入CSV
我尝试过
rates = soup.find_all('table', {'class':['cmp-ratings-expanded']}).find_all("span", style=True)
for rate in rates:
rate = re.match( r'width: (\d+)', rate["style"])
来自source,但发现我只解析60、80个数字。由于“美丽汤”的 find()方法,剩下的numbers (100, 40)
都没有解析。
最终,我需要写入csv文件。 这里是由于for循环,我从上面的代码中得到的结果:
|60|
|100|
要写入csv的代码:
with open(some_file.csv, 'w+') as file:
file.write(rate)
解析所有宽度: 80 .0%; ,将样式信息写入行中的csv中:
|Job Work|Compensation|
|60|80|
|100|40|
答案 0 :(得分:0)
这是一种方法。
例如:
import re
import csv
from bs4 import BeautifulSoup
html = """<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 60.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 80.0%;"></span>
<td>Compensation</td>
</table>
<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 100.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 40.0%;"></span>
<td>Compensation</td>
</table>"""
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('table', {'class':['cmp-ratings-expanded']})
result = []
for table in tables:
temp = []
for span in table.find_all("span", style=True):
rate = re.match( r'width: (\d+)', span["style"])
temp.append(rate.group(1)) #Get Rates.
result.append(temp)
#Write to CSV
with open(filename, "w") as csvfile:
writer = csv.writer(csvfile, delimiter="|")
#Write Header
writer.writerow(["Job Work", "Compensation"])
writer.writerows(result)
答案 1 :(得分:0)
您可以使用CSS选择器来解析HTML:
import re
from bs4 import BeautifulSoup
data = '''<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 60.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 80.0%;"></span>
<td>Compensation</td>
</table>
<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 100.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 40.0%;"></span>
<td>Compensation</td>
</table>'''
soup = BeautifulSoup(data, 'lxml')
r_number = re.compile(r'(\d+)\.?\d*%')
with open('out.csv', 'w') as f_out:
f_out.write('|Job Work|Compensation|\n')
for job, compensation in zip(soup.select('span[style]:has(+:contains("Job Work"))'),
soup.select('span[style]:has(+:contains("Compensation"))')):
job_number = r_number.search(job['style'])[1]
compensation_number = r_number.search(compensation['style'])[1]
f_out.write('|' + '|'.join([job_number, compensation_number]) + '|\n')
文件out.csv
包含:
|Job Work|Compensation|
|60|80|
|100|40|
进一步阅读: