Question

所以我能够从下面的代码中删除法规的整个章节。但是，让我们说如果我只想用“＆＃34;农业＆＃34;在里面。我该怎么做？

from bs4 import BeautifulSoup
import requests
import re

f = open('C:\Python27\projects\Florida\FL_finalexact.doc','w')

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter{chapter:03d}/All"

for chapter in range (1,40):  
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:   
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
     for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write ('\n\n' + title.text + '\n\n' )

     for data in tableContents.find_all ('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n" + str(data)+ "\n" 
      f.write(data)

我是否需要为此任务使用正则表达式？

Answer 1

你不需要正则表达式。 BeautifulSoup比那更强大：

soup = BeautifulSoup(r.content)
soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)

足以为您提供包含单词＆＃34; agricultural＆＃34;的所有元素的列表。在里面。然后，您可以遍历列表并提取相关字符串：

results = soup.find_all(...) # function as before
scraped_paragraphs = map(lambda element: element.string, results)

然后将scraped_paragraphs中的元素写在任何地方。

如何运作

BeautifulSoup支持find_all()功能，该功能将返回与find_all()输入的特定条件匹配的所有标记。这个标准可以采用正则表达式，函数，列表甚至只是True的形式。在这种情况下，一个合适的布尔函数就足够了。

然而，更重要的是，soup中的每个HTML标记都被各种属性编入索引。您可以在HTML标记中查询属性，子项，兄弟，以及包含string标记的内部文本。

这个解决方案的作用是简单地过滤解析的HTML，查找string包含＆＃34;农业＆＃34;的所有元素。在里面。因为并非每个元素都有 string属性，所以有必要确保我们首先检查它是否有一个 - 因此我们为什么要if tag.string并返回False如果没有找到。

示例

这里是Chapter001的样子：

soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)
>>>> [<span class="Text Intro Justify" xml:space="preserve">Crude turpentine gum (oleoresin), the product of a living tree or trees of the
     pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural 
     products, farm products, and agricultural commodities.</span>, 
     <span class="Text Intro Justify" xml:space="preserve">Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or 
     words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; 
     aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; 
     and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.
     </span>]

在map上调用results函数会产生内部字符串而不包含span元素和讨厌的属性：

map(lambda element : element.string, soup.find_all(...)
>>>> [u'Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', 
      u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

Answer 2

您不想搜索每个标记，可以选择包含文本和使用过滤器的范围标记，您可以使用css selector选择标记。你想要的是span class="Text Intro Justify"里面的文字：

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get(base_url).content)

text = [t.text for t in soup.select('div span.Text.Intro.Justify') if "agricultural" in t.text]

哪个会给你：

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

如果您想匹配不区分大小写，则需要if "agricultural" in t.text.lower()

此外，如果您需要精确匹配，则需要拆分文本或使用带有字边界的正则表达式，否则您最终会得到某些单词的误报。

soup = BeautifulSoup(requests.get(base_url).content)
import re

# look for exact word
r = re.compile(r"\bagricultural\b", re.I)
text = [t.text for t in soup.find_all('span', {"class":'Text.Intro Justify'},text=r) ]

使用re.I会匹配agricultural和Agricultural。

如果字符串包含"foo"，则使用字边界意味着您不会匹配"foobar"。

无论您采用哪种方法，一旦您知道要搜索的特定标签，就应该只搜索那些标签，搜索每个标签可能意味着您获得与您实际想要的完全无关的匹配。

如果您要按照文字过滤进行大量解析，可能会发现lxml非常强大，使用xpath expression我们可以非常轻松地过滤：

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from lxml.etree import fromstring, HTMLParser
import requests
r = requests.get(base_url).content
xml = fromstring(r, HTMLParser())

print(xml.xpath("//span[@class='Text Intro Justify' and contains(text(),'agricultural')]//text()"))

这给了你：

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

对于与xpath匹配的大写或小写，我们需要将A转换为：

(xml.xpath("//span[@class='Text Intro Justify' and  contains(translate(text(), 'A','a'), 'agricultural')]//text()")

您看到的\u201是“的{{3}}输出，当您实际打印字符串时，您会看到str输出。

In [3]: s = u"Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture."

In [4]: print(s)
Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.

使用BeautifulSoup进行Python抓取，只删除带有某些单词的段落

2 个答案:

如何运作

示例