我正在尝试从网站的HTML代码中获取一些信息。在网站上有公司和一些有关它们的信息。我需要为每家公司提供“名称”,“描述”,“重点”和“位置”信息。以下是其中一家公司的一组示例:
<span class="search-type f-header">Exhibitor</span>
<h2 itemprop="name" class="search-name f-subheadline">A.M.I.</h2>
<h3 itemprop="address" class="search-attribute f-default">F - Saint Marcel
</h3>
<p itemprop="description" class="search-excerpt f-default">The A.M.I. Company manufactures indicator panels and alarm annunciator since 1976. They are used in environments with significant ...
</p>
<p itemprop="makesOffer" class="search-info f-default">Focus: On-site <strong>control</strong> panels for fieldbus systems
</p><span class="search-location f-default">Hall 12, Stand G40</span>
网站上有近5000家公司,我试图通过在网站上进行一些查询来缩小结果,我得到的结果不在一个页面中,而是在46个不同的页面中,所有这些页面都有相同的URL每页20家公司。这就是为什么我逐个打开页面并将它们的源代码复制到一个文本文件然后在python中打开它。我的python代码来处理这个:
from bs4 import BeautifulSoup
import urllib.request
from requests import get
import csv
import pandas as pd
url_oku = open('hannover.txt')
soup = BeautifulSoup(url_oku, 'html.parser')
total = []
mid = []
companies = ['?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?']
descriptions = ['?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?']
locations = ['?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?']
focus = ['?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?']
for count,comp in enumerate(soup.find_all('h2', {'itemprop': 'name'})):
companies[count]=(comp.text)
for count,desc in enumerate(soup.find_all('p',{'class': 'search-excerpt f-default'})):
descriptions[count]=(desc.text)
for count,foc in enumerate(soup.find_all('p',{'class': 'search-info f-default'})):
focus[count]=(foc.text.strip())
for count,loc in enumerate(soup.find_all('span',{'class': 'search-location f-default'})):
locations[count]=(loc.text)
print(len(companies), len(descriptions), len(locations),len(focus))
for i in range(len(companies)):
mid.append(companies[i])
mid.append(descriptions[i])
mid.append(focus[i])
mid.append(locations[i])
total.append(mid)
mid = []
my_df = pd.DataFrame(total)
my_df.columns = ['Company', 'Descr.','Focus','Location']
print(my_df)
我创建了一个20'的列表?'确保每个列表中都包含20个元素,以避免丢失信息。但不幸的是,一些公司缺少一些信息。如:
<span class="search-type f-header">Exhibitor</span>
<h2 itemprop="name" class="search-name f-subheadline">STOCKO CONTACT</h2>
<h3 itemprop="address" class="search-attribute f-default">D - Wuppertal
</h3>
<p itemprop="description" class="search-excerpt f-default">... our products at a high quality level. Products that can be found equally in heating <strong>controls</strong>, drink dispensing machines ...
</p><span class="search-location f-default">Hall 9, Stand F69</span></a>
例如,在该公司中,缺少焦点信息。当我使用findall方法时,它只查找现有的信息并将其添加到列表中,而不考虑它在页面中的位置或它所属的公司。当我遍历公司名称并将信息添加到列表'total'时,这会导致公司及其信息在创建数据帧时不匹配。 The excel output when there are missing infos 正如您在焦点信息中可以看到的那样:
<p itemprop="makesOffer" class="search-info f-default">
某些公司缺少
我无法与相关公司的现有焦点信息相匹配。
有没有办法解决这个问题?
答案 0 :(得分:0)
在此处提取以下信息:
line = '''<span class="search-type f-header">Exhibitor</span>
<h2 itemprop="name" class="search-name f-subheadline">A.M.I.</h2>
<h3 itemprop="address" class="search-attribute f-default">F - Saint Marcel
</h3>
<p itemprop="description" class="search-excerpt f-default">The A.M.I. Company manufactures indicator panels and alarm annunciator since 1976. They are used in environments with significant ...
</p>
<p itemprop="makesOffer" class="search-info f-default">Focus: On-site <strong>control</strong> panels for fieldbus systems
</p><span class="search-location f-default">Hall 12, Stand G40</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'lxml')
print [values.text for values in soup.findAll("span")]
print [values.text for values in soup.findAll("h2")]
print [values.text for values in soup.findAll("h3")]
print [values.text for values in soup.findAll("p")]
#output:
[u'Exhibitor', u'Hall 12, Stand G40']
[u'A.M.I.']
[u'F - Saint Marcel\n']
[u'The A.M.I. Company manufactures indicator panels and alarm annunciator since 1976. They are used in environments with significant ...\n', u'Focus: On-site control panels for fieldbus systems\n']