Question

我正在尝试使用python，Requests和BeautifulSoup从HTML页面获取一些信息。

我的问题是我无法通过BeautifulSoup获取关键参数（如“Beginning Stocks”和“Domestic Crush”），因为他们有一个错误的“输入”代理名称。

这很奇怪，因为在网站上他们并没有“破产”。我以前从未见过这个。

    <m1_region_group2 region4="World  2/">
        <m1_attribute_group2_collection>
            <m1_attribute_group2 attribute4="Beginning
    Stocks">
                <cell cell_value4="77.73"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Production">
                <cell cell_value4="313.77"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Imports">
                <cell cell_value4="133.33"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Domestic
    Crush">
                <cell cell_value4="275.36"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Domestic
    Total">
                <cell cell_value4="314.35"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Exports">
                <cell cell_value4="132.55"></cell>
            </m1_attribute_group2>
            <m1_attribute_group2 attribute4="Ending
    Stocks">
                <cell cell_value4="77.92"></cell>
            </m1_attribute_group2>
        </m1_attribute_group2_collection>
    </m1_region_group2>

“进口”和“生产”论点运作良好。例如：

    x.find("m1_attribute_group2", {"attribute4":"Imports"}).find("cell")["cell_value4"]

返回'133.33'。

但是当我试图获得国内总计时，结果是“无”，就像BS无法找到参数一样。

   z = x.find("m1_attribute_group2", {"attribute4":"Domestic Total"})

有谁知道发生了什么事？我该如何解决？

Mac OS Hight Sierra / Python3.6

Answer 1

这只是一个格式不正确的HTML BeautifulSoup仍然能够解析。只是attribute4="Domestic Total"永远不会是真的，因为它不是Domestic和Total之间的空格，而是换行符。

单向是使用find()方法解决问题，使用a function作为attribute4属性值，拆分并重新加入，这将有效地删除所有换行并用空格替换它们：

In [19]: soup.find("m1_attribute_group2", attribute4=lambda x: x and " ".join(x.split()) == "Domestic Total")
Out[19]: 
<m1_attribute_group2 attribute4="Domestic
    Total">
<cell cell_value4="314.35"></cell>
</m1_attribute_group2>

然后您可以将其概括为：

def filter_attribute(attr_value):
   def f(attr):
      return attr and " ".join(attr.split()) == attr_value
   return f

并使用：

In [23]: soup.find("m1_attribute_group2", attribute4=filter_attribute("Domestic Total"))
Out[23]: 
<m1_attribute_group2 attribute4="Domestic
    Total">
<cell cell_value4="314.35"></cell>
</m1_attribute_group2>

另一种方法是使用a regular expression和\s+作为单词之间的分隔符，其中\s+表示＆＃34;一个或多个空格字符包括换行符＆＃34;：

In [24]: soup.find("m1_attribute_group2", attribute4=re.compile(r"Domestic\s+Total"))
Out[24]: 
<m1_attribute_group2 attribute4="Domestic
    Total">
<cell cell_value4="314.35"></cell>
</m1_attribute_group2>

Answer 2

更保守的方法是甚至不尝试解决问题，但预处理HTML以消除元素属性中的换行符。巧合的是，这正是the problem we've recently solved here：

In [25]: for tag in soup():
    ...:     tag.attrs = {
    ...:         attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value] 
    ...:               if isinstance(value, list)
    ...:               else " ".join(value.replace("\n", " ").split())
    ...:         for attr, value in tag.attrs.items()
    ...:     }
    ...:     

In [26]: soup.find("m1_attribute_group2", {"attribute4":"Domestic Total"})
Out[26]: 
<m1_attribute_group2 attribute4="Domestic Total">
<cell cell_value4="314.35"></cell>
</m1_attribute_group2>

Answer 3

除了alecxe先生已经展示的内容之外，如果您的要求是找到与之前相关的任何属性，您也可以执行以下操作：

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
for item in soup.select("m1_region_group2 m1_attribute_group2"):
    post = ' '.join(item['attribute4'].split())
    if "Beginning Stocks" in post:  #try to see if it misses any attribute
        val = item.find_next("cell")['cell_value4']
        print(val)

结果：

77.73

如何从破坏的关键参数中获取信息 - BS4和请求（Python）

3 个答案: