Python - 将文本从HTML文件转换为没有唯一标识符标签的csv

时间:2018-06-08 12:55:35

标签: python html beautifulsoup

我已经使用beautifulsoup4从网页上抓取了我想要的一些信息,该网页列出了精神科医生的详细信息,并设法用关键信息重新获得这部分。

<h5>Practice Locations</h5>
    <p>Springfield, 1234<br/> 08 1234 5678</p>
    <p>Shelbyville, 1234<br/>08 1234 5678</p>
<h5>Gender:</h5>
    <p>Male<br/></p>
<h5>Languages spoken (other than English):</h5>
    <p>Spanish<br/></p>
    <p>Italian<br/></p>
<h5>Problem areas treated:</h5>
    <p>Anxiety disorders<br/>Mood disorders<br/>Sexual disorders<br/></p>
<h5>Populations treated:</h5>
<p>Adult<br/>Young adult<br/></p>
<h5>Subspecialty areas:</h5>
    <p>Cancer patients<br/>Gender issues<br/>Pain management<br/>Specialist psychotherapist<br/></p>
<h5>Treatments and services offered:</h5>
    <p>Does not prescribe psychotropics<br/>Psychotherapy – cognitive behavioural therapy (CBT)<br/>Psychotherapy – hypnotherapy<br/>Psychotherapy – interpersonal<br/>Psychotherapy – marital therapy<br/></p>
<h5>Practice details:</h5>
    <p>Can bulk bill selected patients<br/></p>
<p> </p>

我想将每个标题下的信息放入.csv文件的列中,但我无法弄清楚如何执行此操作,因为标题不具有任何方式的唯一标识符。我知道我必须使用标题以某种方式划分不同的列,但我对python完全不熟悉并且不确定如何去做。

手动操作很容易,但我想从同样方式格式化的很多页面中收集这些信息。 为了使事情变得更复杂,一些页面缺少这些标题的信息(例如,他们没有列出被处理的人群或亚专业)所以在尝试收集信息之前我必须检查每个标题是否存在

非常感谢任何指导!

1 个答案:

答案 0 :(得分:0)

您使用h5标记作为标题:

import re
from bs4 import BeautifulSoup as soup
import itertools
headers = [i.text for i in soup(content, 'html.parser').find_all('h5')]
full_data = [[i.text, i] for i in soup(content, 'html.parser').find_all(re.compile('h5|p'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(full_data, key=lambda x:x[0] in headers)]
grouped = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
final_data = {c:{i:str(h)[3:-4].split('<br/>')[1:] for i, h in results} for [_, [[c, _]], _, results] in grouped}

输出:

{'Practice Locations': {'Springfield, 1234 08 1234 5678': [' 08 1234 5678'], 'Shelbyville, 123408 1234 5678': ['08 1234 5678']}, 'Gender:': {'Male': ['']}, 'Languages spoken (other than English):': {'Spanish': [''], 'Italian': ['']}, 'Problem areas treated:': {'Anxiety disordersMood disordersSexual disorders': ['Mood disorders', 'Sexual disorders', '']}, 'Populations treated:': {'AdultYoung adult': ['Young adult', '']}, 'Subspecialty areas:': {'Cancer patientsGender issuesPain managementSpecialist psychotherapist': ['Gender issues', 'Pain management', 'Specialist psychotherapist', '']}, 'Treatments and services offered:': {'Does not prescribe psychotropicsPsychotherapy – cognitive behavioural therapy (CBT)Psychotherapy – hypnotherapyPsychotherapy – interpersonalPsychotherapy – marital therapy': ['Psychotherapy – cognitive behavioural therapy (CBT)', 'Psychotherapy – hypnotherapy', 'Psychotherapy – interpersonal', 'Psychotherapy – marital therapy', '']}, 'Practice details:': {'Can bulk bill selected patients': [''], ' ': []}}