我正在尝试从包含html的CSV列中创建一组关键字。 CSV抱怨类别div中的数据不完整。
categories = []
def find_elms(soup, tag, attribute):
"""Find the block using it's tag and attribute values"""
categories_block = soup.find(tag, attribute)
if categories_block:
keywords = [elm.text for elm in categories_block.findAll('a')]
return keywords
#return [elm.text for elm in categories_block.findAll('a')]
return []
def build_cats(categories):
category = find_elms(soup, 'div', {'id': 'categories'})
'''returns [x,y]'''
for cat in category:
categories.append(category)
build_cats(soup)
我更改了代码以实现如下结果:
[category1,...,category1000]
但是,我的结果是[[category1,..,category25],[category26,...,category50],... []]或一系列导致兔子洞陷入黑暗的错误。
源数据类似于:
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryB</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">A.jpg</a>
<br/></div>
, <div id="col1">
<a href="">B.jpg</a>
<br/></div>
, <div id="col1">
<a href="">C.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
</div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">D.jpg</a>
<br/></div>
, <div id="col1">
<a href="">E.jpg</a>
<br/></div>
, <div id="col1">
<a href="">F.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryC</a></li><li><a href="">CategoryD</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">G.jpg</a>
<br/></div>
, <div id="col1">
<a href="">H.jpg</a>
<br/></div>
, <div id="col1">
<a href="">I.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryE</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">J.jpg</a>
<br/></div>
, <div id="col1">
<a href="">K.jpg</a>
<br/></div>
, <div id="col1">
<a href="">L.jpg</a>
<br/></div>
"
任何修改或建议都会有所帮助。谢谢。
答案 0 :(得分:0)
我将您的源数据粘贴到一个文本文件中,并将其另存为input.csv
。然后,我运行了以下代码行,并能够创建示例源数据中所有类别的列表:
from bs4 import BeautifulSoup
Categories = []
path = 'input.csv'
html = open(path)
bs = BeautifulSoup(html, 'html.parser')
divs = bs.find_all('div', attrs = {'id': 'categories'})
for d in divs:
cats = d.find_all('a')
for c in cats:
cat_label = c.text
if cat_label not in Categories:
Categories.append(cat_label)
Categories
上面的代码生成源数据中所有类别的以下列表:
['CategoryA', 'CategoryB', 'CategoryC', 'CategoryD', 'CategoryE']
每个类别在列表中仅出现一次,无论其在源数据中是否多次出现(例如CategoryA)。