将html文件切片为pandas数据帧,同时保留格式的div标签的父子关系

时间:2017-11-09 16:16:57

标签: pandas beautifulsoup

我正在尝试将html文件剪切为数据帧,以保留div标签之间的父子关系。

例如:

<div class="ddemrcontentitem ddremovable" dd:entityid="0" id="_5C026969-
71BA-456E-A183-BC923BAB9E99" style="clear: both;" 
xmlns:dd="DynamicDocumentation">Orders:
        <div style="padding-left: 8px;">
        <div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251406974" id="_57B1A3DC-1899-4752-9516-6F137BBE1C8F">CBC w/ Auto Diff</div>

        <div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389861" id="_0A418835-4384-4ACC-A4FD-3C901539DADB">Hygiene Activity</div>

        <div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389598" id="_5D06090F-7330-49B1-BB53-28496388E8C1">Regular Diet</div>

        <div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251407213" id="_0D683EC1-4D18-45F4-BD52-0451DDA3BF5A">Sodium Level</div>

        <div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251410812" id="_82ACC1FF-DA2E-472C-BA0F-E881293BDCBA">Sodium Level</div>
        </div>

订单应该是字典或数据框中每个(CBC w / Auto Diff,Regular Diet,Sodium Level,Sodium Level)的父级。

这是我的失败审判:

import pandas as pd
import bs4

'''i imported the file- parsed html using bs4 package
made a list of the div tags and made 2 dictionary too
one with the text and one with the full tags and text
then made tables of them (pandas dataframes)'''

alpha = open('D://python/893714319.00.html','r')
beta = bs4.BeautifulSoup(alpha, 'lxml')

lister = []
fulllister = []
listerer = {}

mydivs = beta.findAll('div')

for div in mydivs:

    lister.append(div.text)
    fulllister.append(div.contents)


listerer = {k:v for v,k in enumerate(lister)}

fulllisterer = {k:v for k,v in enumerate(fulllister)}

listerer = sorted(listerer.items(), key=lambda x: x[1])

fulllisterer = sorted(fulllisterer.items(), key = lambda x:x[1])

listerer = pd.DataFrame(listerer)
fulllisterer = pd.DataFrame(fulllisterer)

listerer.dropna( inplace='True',how='any')
fulllisterer.dropna(axis=1, inplace='True',how='any')

'''trying to characterize the string that is parent and what is child
by counting <div> in it but this is not working , i don't know why
by parent i mean 'orders' and the children would be 'cbc' and so
'''

fulllisterer['divier']= ""
fulllisterer['count']= 0

for string in fulllisterer[1].iteritems():

    fulllisterer['count']=string.count('<div>')
    if string.count('<div>')>1:
        fulllisterer['divier'] = fulllisterer[1]

输出看起来像:

<html>
<body>
<table>

<th>parent</th>
<th>child</th>
<tr>
<td>orders</td>
<td>CBC w/ Auto Diff</td>
</tr>
<tr>
<td>orders</td>
<td> Hygiene Activity</td>
</tr>
<tr>
<td>orders</td>
<td> Regular Diet</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
</table>
</body></html>

the output would be like

1 个答案:

答案 0 :(得分:0)

我认为你只是过度设计了这个。以下代码改编自您的代码段

import pandas as pd
import bs4

beta = bs4.BeautifulSoup(alpha, 'lxml')

mydivs = beta.findAll('div')

lister = []
for div in mydivs:
    lister.append(div.text)


data_list = lister[0].split('\n')
data_list = [el.strip().replace(':', '') for el in data_list if el.strip() != '']
df = pd.DataFrame()
print pd.DataFrame({'parent': data_list[0], 'child':data_list[1:]})

现在你只需要确保为每个父div标签调用它来代替lister [0]。