写到csv文件的行中解析了HTML中的几种css样式

时间:2019-07-12 11:41:13

标签: python html csv beautifulsoup

我正在解析带有CSS样式的html标记。

<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 60.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 80.0%;"></span>
<td>Compensation</td>
</table>

<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 100.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 40.0%;"></span>
<td>Compensation</td>
</table>

我需要获取以下数字:60、80、100、40个数字才能写入CSV

我尝试过

rates = soup.find_all('table', {'class':['cmp-ratings-expanded']}).find_all("span", style=True)
for rate in rates:
    rate = re.match( r'width: (\d+)', rate["style"])

来自source,但发现我只解析60、80个数字。由于“美丽汤”的 find()方法,剩下的numbers (100, 40)都没有解析。

最终,我需要写入csv文件。 这里是由于for循环,我从上面的代码中得到的结果:

|60|
|100|

要写入csv的代码:

with open(some_file.csv, 'w+') as file:
     file.write(rate)

我期望的事情。

解析所有宽度: 80 .0%; ,将样式信息写入行中的csv中:

|Job Work|Compensation|
|60|80|
|100|40|

2 个答案:

答案 0 :(得分:0)

这是一种方法。

例如:

import re
import csv
from bs4 import BeautifulSoup

html = """<table class="cmp-ratings-expanded">
    <span class="cmp-Rating-on" style="width: 60.0%;"></span>
        <td>Job Work</td>
            <span class="cmp-Rating-on" style="width: 80.0%;"></span>
        <td>Compensation</td>
    </table>

    <table class="cmp-ratings-expanded">
    <span class="cmp-Rating-on" style="width: 100.0%;"></span>
    <td>Job Work</td>
    <span class="cmp-Rating-on" style="width: 40.0%;"></span>
    <td>Compensation</td>
</table>"""

soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('table', {'class':['cmp-ratings-expanded']})
result = []
for table in tables:
    temp = []
    for span in table.find_all("span", style=True):
        rate = re.match( r'width: (\d+)', span["style"])  
        temp.append(rate.group(1))                       #Get Rates. 
    result.append(temp)

#Write to CSV
with open(filename, "w") as csvfile:
    writer = csv.writer(csvfile, delimiter="|")
    #Write Header
    writer.writerow(["Job Work", "Compensation"])
    writer.writerows(result)

答案 1 :(得分:0)

您可以使用CSS选择器来解析HTML:

import re
from bs4 import BeautifulSoup

data = '''<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 60.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 80.0%;"></span>
<td>Compensation</td>
</table>

<table class="cmp-ratings-expanded">
<span class="cmp-Rating-on" style="width: 100.0%;"></span>
<td>Job Work</td>
<span class="cmp-Rating-on" style="width: 40.0%;"></span>
<td>Compensation</td>
</table>'''

soup = BeautifulSoup(data, 'lxml')
r_number = re.compile(r'(\d+)\.?\d*%')

with open('out.csv', 'w') as f_out:
    f_out.write('|Job Work|Compensation|\n')
    for job, compensation in zip(soup.select('span[style]:has(+:contains("Job Work"))'),
                                 soup.select('span[style]:has(+:contains("Compensation"))')):
        job_number = r_number.search(job['style'])[1]
        compensation_number = r_number.search(compensation['style'])[1]
        f_out.write('|' + '|'.join([job_number, compensation_number]) + '|\n')

文件out.csv包含:

|Job Work|Compensation|
|60|80|
|100|40|

进一步阅读:

CSS Selectors reference