Question

我正在尝试解析文档http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm。我想在Commission:之前提取所有内容。 enter image description here

（我需要Beautifulsoup，因为第二步是提取国家和人名）

如果我这样做：

import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))
print soup.find_all(text=re.compile("Commission"))

我得到的唯一结果是：

[u'The Governments of the Member States and the European Commission were represented as follows:']

这是该单词的第一次出现，但不是我要找的那一行。我认为这是因为该文件无效，但不确定。如果我查看源代码：

<B><U><P>Commission</B></U>:</P>

但是，如果我打印soup，我可以看到文字，标签重新排序：

<u><b>Commission</b></u>

如何获取此元素"Commission:"？

我使用的是python 2.7和Beautifoulsoup 4.3.2。

编辑：已解决！

正如alecxe建议的那样，我更换了一行：

soup=BeautifulSoup(urllib.urlopen(url))

与

BeautifulSoup(urllib.urlopen(url), 'html.parser')

它现在有效:)。谢谢大家。

编辑：类似的问题

我对同样的解决方案有类似的问题：

Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

Beautiful Soup findAll doen't find them all

Answer 1

如果您希望标签前面的所有内容都包含＆＃34; Commision：＆＃34;值。你可以在没有beatifulsoup的情况下做到这一点......只需将lika视为字符串变量并搜索正确的关键字并删除字符串的其余部分。

但是当我运行你的代码时，我得到了以下内容：

[u'The Governments of the Member States and the European Commission were represe
nted as follows:', u'Commission', u'The Council held an orientation debate on ke
y economic policy issues with a view to giving guidance to the Commission on the
 questions Ministers wish to be addressed in the broad economic policy guideline
s 1998/99 for which the Commission will present its recommandation later in the
Spring. It was noted that the forthcoming guidelines are of particular importanc
e given the start of stage 3 of EMU.', u'The debate was based on an assessment o
f the economic situation and outlook in the Community carried out by the Commiss
ion and the Economic Policy and Monetary Committees.', u"The Council held an ori
entation debate on the Commission's Communication setting out a possible Communi
ty framework allowing Member States to experiment with reduced VAT rates for lab
our-intensive services in order to boost employment in small businesses without
distorting international competition. ", u'This Communication was tabled by the
Commission as a follow-up to the Employment European Council of last November in
 Luxembourg, which concluded that, in order to make the taxation system more emp
loyment-friendly, "Member States will examine, without obligation, the advisabil
ity of reducing the rate of VAT on labour-intensive services not exposed to cros
s-border competition".', u"In conclusion, the Council invited Coreper to examine
 the technical questions arising from today's debate and to report back to it wi
th a view to deciding on a possible request to the Commission to submit a propos
al in this area. ", u"This technical examination should be carried out, taking i
nto account the criteria indicated in the Commission's Communication for a reduc
ed VAT rate, on the following questions :", u'An initial trial period running un
til the year 2002 should identify the best method for allocating FISIM. At the e
nd of this period, the Commission will assess the results of the trial period an
d decide, by means of a comitology procedure, on the final methodology to be app
lied. However, a unanimous decision by the Council would be needed in order to u
se the new methodology in budgetary calculations on other Community policies and
 notably concerning "own resources".']

Answer 2

迭代p元素并在找到以Commission开头的文字时停止：

import urllib
from bs4 import BeautifulSoup

url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))

for item in soup.find_all('p'):
    if item.text.startswith('Commission'):
        break
    else:
        print item.text

它打印Commission之前的所有内容：

The Governments of the Member States and the European Commission were represented as follows:
Belgium:
...
Ms Helen LIDDELL            Economic Secretary to the Treasury
* * *

beautifulsoup与无效的HTML文档

编辑：已解决！

编辑：类似的问题

2 个答案: