Question

我有以下HTML脚本，我正在搜索特定的单词。

<tbody>
            <tr>
                <th>Berufsbezeichnung:</th>
                <td class="gray">ExampleName</td>
            </tr>
                        <tr>
                <th>Anrede:</th>
                <td class="gray">Herrn</td>
            </tr>
                        <tr>
                <th>Name:</th>
                <td class="gray">ExampleLastName</td>
            </tr>
                        <tr>
                <th>Vorname:</th>
                <td class="gray">ExampleSurname</td>
            </tr>
            …
</tbody>

我希望有不同的变量“Berufsbezeichnung”，“Anrede”，......必须填写正确的内容。例如，在相同的数据集中缺少“Berufsbezeichnung”，因此必须将此变量留空。

我尝试了一个scrapy脚本来搜索内容，但它不起作用：

soup = BeautifulSoup(response.css('table').extract()[0],'lxml')

for elem in soup.findAll('tr'):
    for eleme in elem.findAll('th'):
        if eleme.get_text()=='Berufsbezeichnung:':
            Berufsbezeichnung = elem.css('td.gray::text')
        if eleme.get_text()=='Anrede:':
            Anrede = elem.css('td.gray::text')
        ...

有人有想法或者也许更容易吗？

非常感谢！

Answer 1

正如@eLRuLL评论所指出的，我不明白为什么你使用BeautifulSoup，因为scrapy已经powerful tool available。

对于您的情况，我建议您只使用xpath：

extracted_values = {} # Store the extracted values in a dictionnary

# Iterate on the tr node containted in the table node
for tr_selector in response.selector.xpath('//table//tr'):
     th_text = tr_selector.xpath('./th/text()').extract_first()

     if th_text: # The th node contain text, read the text from the td node
        extracted_values[th_text] = tr_selector.xpath('./td/text()').extract_first()

Answer 2

试试这个：

search_by_header = '//th[contains(., "{}")]/following-sibling::td/text()'.format
Berufsbezeichnung = response..xpath(search_by_header("Berufsbezeichnung")).extract_first()
Anrede = response.xpath(search_by_header("Anrede")).extract_first()

Scrapy：由于HTML Text中的搜索字符而选择特定单词

2 个答案: