尝试制作自定义BeautifulSoup Dagster类型时出现NoneType错误

时间:2020-01-25 00:24:48

标签: python-3.x beautifulsoup dagster

我一直在搞弄@dagster_type,并试图制作一个自定义的HtmlSoup类型。基本上是一个漂亮的@dagster_type包装器,围绕着BeautifulSoup对象。

import requests
from bs4 import BeautifulSoup
from dagster import (
    dagster_type,
    input_hydration_config,
    Selector,
    Field,
    String,
    TypeCheck,
    EventMetadataEntry
)

def max_depth(soup):
    if hasattr(soup, "contents") and soup.contents:
        return max([max_depth(child) for child in soup.contents]) + 1
    else:
        return 0

def html_soup_type_check(value):
    if not isinstance(value, BeautifulSoup):
        return TypeCheck(
            success=False,
            description=(
                'HtmlSoup should be a BeautifulSoup Object, got '
                '{type_}'
            ).format(type_=type(value))
        )

    if not hasattr(soup, "contents"):
        return TypeCheck(
            success=False,
            description=(
                'HtmlSoup has no contents, check that the URL has content'
            )
        )

    return TypeCheck(
        success=True,
        description='HtmlSoup Summary Stats',
        metadata_entries=[
            EventMetadataEntry.text(
                str(max_depth(value)),
                'max_depth',
                'Max Nested Depth of the Page Soup'
            ),
            EventMetadataEntry.text(
                str(set(tag.name for tag in value.find_all())),
                'tag_names',
                'All available tags in the Page Soup'
            )
        ]
    )


@input_hydration_config(
    Selector(
        {
            'url': Field(
                String,
                is_optional=False,
                description=(
                    'URL to be ingested and converted to a Soup Object'
                )
            )
        }
    )
)
def html_soup_input_hydration_config(context, selector):
    url = selector['url']
    res = requests.get(url, params={
        'Content-type': 'text/html'
    })

    if (not res.status_code == 200):
        return TypeCheck(
            success=False,
            description=(
                '{status_code} ERROR, Check that URL: {url} is correct'
            ).format(status_code=res.status_code, url=url)
        )
    soup = BeautifulSoup(res.content, 'html.parser')
    return HtmlSoup(soup)

@dagster_type(
    name='HtmlSoup',
    description=(
        'The HTML extracted from a URL stored in '
        'a BeautifulSoup object.'
    ),
    type_check=html_soup_type_check,
    input_hydration_config=html_soup_input_hydration_config
)
class HtmlSoup(BeautifulSoup):
    pass

这是我一直在尝试的方法,但是每当我尝试调用使用的实体时,都会使用HtmlSoup类型作为输入,例如

@solid
def get_url(context, soup: HtmlSoup):
    return soup.contents

我收到此错误

TypeError:“ NoneType”对象不可调用

  File "/Users/John/Documents/.../venv/lib/python3.7/site-packages/dagster/core/engine/engine_inprocess.py", line 241, in dagster_event_sequence_for_step
    for step_event in check.generator(_core_dagster_event_sequence_for_step(step_context)):
  File "/Users/John/Documents/.../venv/lib/python3.7/site-packages/dagster/core/engine/engine_inprocess.py", line 492, in _core_dagster_event_sequence_for_step
    for input_name, input_value in _input_values_from_intermediates_manager(step_context).items():
  File "/Users/John/Documents/.../venv/lib/python3.7/site-packages/dagster/core/engine/engine_inprocess.py", line 188, in _input_values_from_intermediates_manager
    step_context, step_input.config_data
  File "/Users/John/Documents/.../venv/lib/python3.7/site-packages/dagster/core/types/config_schema.py", line 73, in construct_from_config_value
    return func(context, config_value)
  File "/Users/John/Documents/.../custom_types/html_soup.py", line 82, in html_soup_input_hydration_config
    return HtmlSoup(soup)
  File "/Users/John/Documents/.../venv/lib/python3.7/site-packages/bs4/__init__.py", line 286, in __init__
    markup = markup.read()

我得到一些额外的信息

An exception was thrown during execution that is likely a framework error, rather than an error in user code.
Original error message: TypeError: 'NoneType' object is not callable

我已经花了一段时间研究@dagster_type装饰器的内部以及@input_hydration_config装饰器的工作原理,但到目前为止有点茫然。

感谢所有帮助!

1 个答案:

答案 0 :(得分:1)

实际上,我可以使用文档

中所述的as_dagster_type方法来解决这个问题
HtmlSoup = as_dagster_type(
    BeautifulSoup,
    name='BeautifulSoupHTML',
    description='''
        Beautiful Soup HTML Object
    ''',
    input_hydration_config=html_soup_input_hydration_config,
    type_check=html_soup_type_check
)