我正在尝试将html文件剪切为数据帧,以保留div标签之间的父子关系。
例如:
<div class="ddemrcontentitem ddremovable" dd:entityid="0" id="_5C026969-
71BA-456E-A183-BC923BAB9E99" style="clear: both;"
xmlns:dd="DynamicDocumentation">Orders:
<div style="padding-left: 8px;">
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251406974" id="_57B1A3DC-1899-4752-9516-6F137BBE1C8F">CBC w/ Auto Diff</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389861" id="_0A418835-4384-4ACC-A4FD-3C901539DADB">Hygiene Activity</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389598" id="_5D06090F-7330-49B1-BB53-28496388E8C1">Regular Diet</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251407213" id="_0D683EC1-4D18-45F4-BD52-0451DDA3BF5A">Sodium Level</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251410812" id="_82ACC1FF-DA2E-472C-BA0F-E881293BDCBA">Sodium Level</div>
</div>
订单应该是字典或数据框中每个(CBC w / Auto Diff,Regular Diet,Sodium Level,Sodium Level)的父级。
这是我的失败审判:
import pandas as pd
import bs4
'''i imported the file- parsed html using bs4 package
made a list of the div tags and made 2 dictionary too
one with the text and one with the full tags and text
then made tables of them (pandas dataframes)'''
alpha = open('D://python/893714319.00.html','r')
beta = bs4.BeautifulSoup(alpha, 'lxml')
lister = []
fulllister = []
listerer = {}
mydivs = beta.findAll('div')
for div in mydivs:
lister.append(div.text)
fulllister.append(div.contents)
listerer = {k:v for v,k in enumerate(lister)}
fulllisterer = {k:v for k,v in enumerate(fulllister)}
listerer = sorted(listerer.items(), key=lambda x: x[1])
fulllisterer = sorted(fulllisterer.items(), key = lambda x:x[1])
listerer = pd.DataFrame(listerer)
fulllisterer = pd.DataFrame(fulllisterer)
listerer.dropna( inplace='True',how='any')
fulllisterer.dropna(axis=1, inplace='True',how='any')
'''trying to characterize the string that is parent and what is child
by counting <div> in it but this is not working , i don't know why
by parent i mean 'orders' and the children would be 'cbc' and so
'''
fulllisterer['divier']= ""
fulllisterer['count']= 0
for string in fulllisterer[1].iteritems():
fulllisterer['count']=string.count('<div>')
if string.count('<div>')>1:
fulllisterer['divier'] = fulllisterer[1]
输出看起来像:
<html>
<body>
<table>
<th>parent</th>
<th>child</th>
<tr>
<td>orders</td>
<td>CBC w/ Auto Diff</td>
</tr>
<tr>
<td>orders</td>
<td> Hygiene Activity</td>
</tr>
<tr>
<td>orders</td>
<td> Regular Diet</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
</table>
</body></html>
答案 0 :(得分:0)
我认为你只是过度设计了这个。以下代码改编自您的代码段
import pandas as pd
import bs4
beta = bs4.BeautifulSoup(alpha, 'lxml')
mydivs = beta.findAll('div')
lister = []
for div in mydivs:
lister.append(div.text)
data_list = lister[0].split('\n')
data_list = [el.strip().replace(':', '') for el in data_list if el.strip() != '']
df = pd.DataFrame()
print pd.DataFrame({'parent': data_list[0], 'child':data_list[1:]})
现在你只需要确保为每个父div标签调用它来代替lister [0]。