Question

我正在使用此代码：

from bs4 import BeautifulSoup
import glob
import os
import re

def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.glob('*.html'):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            results = [item for item in soup.findAll("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name'])]
            print(results)
                #print(file, end="| ")
                #print(item['name'], end="| ")
                #print(item.get_text())
trade_spider()

我正在尝试使用BS4解析计算机上某个目录中的多个HTML文档。我的目标是找到以＆＃34; ix：NonFraction ....＆＃34;开头的标签。包含一个name属性，该属性在“审计费用开销”之前可以有多个表达式。比如name =＆＃34; aurep：AuditFeesExpenses，name = bus：AuditFeesExpenses＆＃34;等（这就是我使用RegEx的原因）。所以，如果BS4找到了特定标签，我想用soup.get_text（Value）从中提取文本。

任何人都知道我错过了什么？

更新：示例标记为：

    <td style=" width:12.50%; text-align:right; " class="ta_60">
<ix:nonFraction contextRef="ThirdPartyAgentsHypercube_FY_31_12_2012_Set1"
 name="ns19:AuditFeesExpenses" unitRef="GBP" decimals="0"
 format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org
/2008/inlineXBRL">3,600</ix:nonFraction></td>

通常这个标记出现在一行中，为清楚起见，我插入了一些换行符！

我的最终代码如下：

from bs4 import BeautifulSoup
import glob
import os
import re

def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.glob('*.html'):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if re.match(".*AuditFeesExpenses", item['name']):
                    print(file, end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
trade_spider()

并给我这个输出：

Prod224_0010_00079350_20140331.html |英国aurep：AuditFeesExpenses | 2000

Answer 1

findAll()函数的第一个参数为name。当你打电话

`soup.findAll('ix:NonFraction', name=re.compile("^[^:]:AuditFeesExpenses"))`,

您实际使用参数soup AND name=ix:NonFraction致电name=re.compile("^[^:]:AuditFeesExpenses")。当然，我们只能将name设置为等于这两个输入中的一个，从而产生错误。

错误消息显示find_all()而不是findAll()。从docs开始，我们发现findAll是find_all的旧方法名称。应该使用find_all方法。

混淆可能来自属性name。区分BeautifulSoup属性name和html属性name非常重要。为了演示，我假设标签具有以下格式：

<body>
    <ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction>
</body>

我们可以找到<ix:NonFraction>的所有soup.find_all("ix:nonfraction")代码。这给出了包含结果的以下列表：

[<ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction>]

遍历此单项列表，以查看两个不同的名称属性。首先，我们访问BeautifulSoup name属性作为对象的属性：

for item in soup.find_all("ix:nonfraction"):
    print(item.name)

Out: 'ix:nonfraction'

要查看html名称属性，请将name作为字典键访问：

for item in soup.find_all("ix:nonfraction"):
    print(item['name'])

Out: 'AuditFeesExpenses'

将两个搜索结合在一起以缩小搜索范围：

results = [item for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name'])

Out: [<ix:nonfraction name="ns19:AuditFeesExpenses">3,600</ix:nonfraction>]

或者，如果我们想获得每场比赛的文字：

results = [item.get_text() for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name'])

Out: [3,600]

完整输出的建议代码：

from bs4 import BeautifulSoup
import glob
import os

def trade_spider():
    os.chdir(r"C:\Independent Auditors Report")
    for file in glob.glob('*.html'):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if re.match("^[^:]:AuditFeesExpenses", item['name'])
                    print(file, end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
trade_spider()

美丽的汤4 HTML文档目录

1 个答案: