使用Python 2从网站的特定段落中提取文本

时间:2017-02-26 15:46:18

标签: python parsing beautifulsoup python-requests lxml

我想提取一些段落,这些段落列出了报告增长和收缩的行业列表以及受访者所说的内容等(这可以在网页的多个位置找到)。这些段落通常位于表格之上。如何使用Requests,lxml,BeautifulSoup来解析和选择我需要的段落?

https://github.com/dmlc/xgboost/tree/master/jvm-packages

我尝试使用lxml和xpath,但每个月网站都会随新报告略有变化,代码停止工作。

2 个答案:

答案 0 :(得分:3)

第三种解决方案是使用Pyquery。它很快,它使用与Jquery完全相同的选择器。您可以使用Chrome Gadget Selector轻松找到它们。

enter image description here

然后,它仍然只是使用它。

from pyquery import PyQuery as pq
import requests

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
content = requests.get(url).content
doc = pq(content)

respondent = doc(".formatted_content ul").text()

print(respondent)

输出:

“Demand very steady to start the year.” (Chemical Products) “January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products) “Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics & Rubber Products) “Business looks stronger moving into the first quarter of 2017.” (Primary Metals) “Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage & Tobacco Products) “Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery) “Year starting on pace with Q4 2016.” (Transportation Equipment) “Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing) “Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum & Coal Products) “Steady demand from automotive.” (Fabricated Metal Products)

答案 1 :(得分:1)

此代码与您使用的代码有多接近?

它使用正则表达式识别段落,即受访者所说的事项列表之前的行。然后它只显示结果。

>>> import requests
>>> URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
>>> r = requests.get(URL)
>>> page = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page, 'lxml')
>>> import re
>>> paras = soup.find_all('p', string=re.compile('(?:growth)|(?:contraction).*? are\:'))
>>> saying = soup.find_all('strong', string=re.compile('WHAT RESPONDENTS ARE SAYING'))[0]
>>> for i, para in enumerate(paras):
...     'Paragraph ', i
...     para
...     
('Paragraph ', 0)
<p>Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics &amp; Rubber Products; Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage &amp; Tobacco Products; Machinery; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Computer &amp; Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture &amp; Related Products; Electrical Equipment, Appliances &amp; Components; and Printing &amp; Related Support Activities.</p>
('Paragraph ', 1)
<p>The 12 industries reporting growth in new orders in January — listed in order — are: Plastics &amp; Rubber Products; Apparel, Leather &amp; Allied Products; Miscellaneous Manufacturing; Chemical Products; Paper Products; Transportation Equipment; Electrical Equipment, Appliances &amp; Components; Petroleum &amp; Coal Products; Primary Metals; Machinery; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in new orders during January are: Nonmetallic Mineral Products; Wood Products; Textile Mills; Computer &amp; Electronic Products; and Furniture &amp; Related Products.</p>
('Paragraph ', 2)
<p>The 10 industries reporting growth in production during the month of January — listed in order — are: Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Petroleum &amp; Coal Products; Plastics &amp; Rubber Products; Transportation Equipment; Chemical Products; Machinery; Food, Beverage &amp; Tobacco Products; and Computer &amp; Electronic Products. The five industries reporting a decrease in production during January are: Wood Products; Textile Mills; Nonmetallic Mineral Products; Electrical Equipment, Appliances &amp; Components; and Furniture &amp; Related Products.</p>
('Paragraph ', 3)
<p>Of the 18 manufacturing industries, the 10 reporting employment growth in January — listed in order — are: Textile Mills; Paper Products; Food, Beverage &amp; Tobacco Products; Machinery; Electrical Equipment, Appliances &amp; Components; Chemical Products; Miscellaneous Manufacturing; Transportation Equipment; Computer &amp; Electronic Products; and Nonmetallic Mineral Products. The five industries reporting a decrease in employment in January are: Plastics &amp; Rubber Products; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Printing &amp; Related Support Activities. </p>
('Paragraph ', 4)
<p>The seven industries reporting growth in order backlogs in January — listed in order — are: Wood Products; Plastics &amp; Rubber Products; Electrical Equipment, Appliances &amp; Components; Primary Metals; Fabricated Metal Products; Miscellaneous Manufacturing; and Chemical Products. The seven industries reporting a decrease in order backlogs during January — listed in order — are: Nonmetallic Mineral Products; Textile Mills; Paper Products; Computer &amp; Electronic Products; Food, Beverage &amp; Tobacco Products; Transportation Equipment; and Furniture &amp; Related Products.</p>
('Paragraph ', 5)
<p>The eight industries reporting growth in new export orders in January — listed in order — are: Wood Products; Paper Products; Petroleum &amp; Coal Products; Chemical Products; Fabricated Metal Products; Transportation Equipment; Miscellaneous Manufacturing; and Food, Beverage &amp; Tobacco Products. The four industries reporting a decrease in new export orders during January are: Textile Mills; Nonmetallic Mineral Products; Plastics &amp; Rubber Products; and Machinery. Six industries reported no change in new export orders in January compared to December.</p>
('Paragraph ', 6)
<p>The four industries reporting growth in imports during the month of January are: Furniture &amp; Related Products; Apparel, Leather &amp; Allied Products; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in imports during January are: Plastics &amp; Rubber Products; Primary Metals; Nonmetallic Mineral Products; Transportation Equipment; and Computer &amp; Electronic Products. Eight industries reported no change in imports in January compared to December.</p>
>>> saying.findNextSibling()
<ul style="list-style-type: square;">
<li>“Demand very steady to start the year.” (Chemical Products)</li>
<li>“January revenue target slightly lower following a big December shipment month.” (Computer &amp; Electronic Products)</li>
<li>“Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics &amp; Rubber Products)</li>
<li>“Business looks stronger moving into the first quarter of 2017.” (Primary Metals)</li>
<li>“Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage &amp; Tobacco Products)</li>
<li>“Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery)</li>
<li>“Year starting on pace with Q4 2016.” (Transportation Equipment)</li>
<li>“Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing)</li>
<li>“Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum &amp; Coal Products)</li>
<li>“Steady demand from automotive.” (Fabricated Metal Products)</li>
</ul>
>>>