Question

以下代码：

<?php

namespace App\Providers;

use Illuminate\Support\ServiceProvider;

class AppServiceProvider extends ServiceProvider
{
    /**
     * Bootstrap any application services.
     *
     * @return void
     */
    public function boot()
    {
        // Your logger goes here
        error_log('log...');
    }

    /**
     * Register any application services.
     *
     * @return void
     */
    public function register()
    {
        //
    }
}

产生以下样本输出：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

def getDates(URL):
    dates = []
    # if page not found, HTTPError is thrown
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None

    bsObj = BeautifulSoup(html, "lxml")
    data = bsObj.find("table", {"class":"sortable wikitable"}).children
    for child in data:
        print(child)

我要抓的唯一一行是带日期的行。这是 2017年2月26日。大约有80多个条目就像这样。我已经尝试过请求顶部<tr> <td><a href="/wiki/89th_Academy_Awards" title="89th Academy Awards">89th</a></td> <td>February 26, 2017</td> <td>2016</td> <td><a href="/wiki/Moonlight_(2016_film)" title="Moonlight (2016 film)">Moonlight</a></td> <td>217 !3 hours, 49 minutes</td> <td>32.9 million</td> <td>22.4</td> <td rowspan="2"><a href="/wiki/Jimmy_Kimmel" title="Jimmy Kimmel">Jimmy Kimmel</a></td> </tr>行的兄弟，并得到一个td，我无法除外或循环（如其他帖子所示），因为Spyder说NavigableString未定义，无法导入，并且不是可识别的错误（除NavigableString error产生空白屏幕外）。我知道那里有一个空白区域。我已经尝试找到每个具有AttributeError标签的孩子，该标签的字符串可以解析为与日期对应的正则表达式。那也行不通。错误说我可以把这个参数放在我的.find（）函数中，虽然文档 - 我在我面前 - 说不然。

关于出了什么问题的想法，以及我如何获得这一行？

Answer 1

如果您想处理列表中的所有<td>标记，则可以调用列表中的索引来获取第二个项：

html_doc = """
    <tr>
    <td><a href="/wiki/89th_Academy_Awards" title="89th Academy Awards">89th</a></td>
    <td>February 26, 2017</td>
    <td>2016</td>
    <td><i><a href="/wiki/Moonlight_(2016_film)" title="Moonlight (2016 film)">Moonlight</a></i></td>
    <td><span class="sortkey" style="display:none;">217 !</span><span class="sorttext">3 hours, 49 minutes</span></td>
    <td>32.9 million</td>
    <td>22.4</td>
    <td rowspan="2"><a href="/wiki/Jimmy_Kimmel" title="Jimmy Kimmel">Jimmy Kimmel</a></td>
    </tr>
    """

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all_tds = soup.find_all('td')

print all_tds[1].text # index the 2nd item

输出：

February 26, 2017

Answer 2

正则表达可能是正确的方法，索引可能是错误的

日期单元格可以在任何列中，不要假设它是第二列（您是否也生成html？您的生成是否启用变量来控制生成和处理？中间是否有获取层？）。将来可能会有一些简单的更改（例如排序或可配置的表列），这些更改可能会破坏您的更改。请考虑以下代码。

time_y

Answer 3

非常感谢，有关循环需求，标签使用和正则表达式有用性的说明。以下代码产生了所需的结果。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

def getDates(URL):
    # if page not found, HTTPError is thrown
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None

    bsObj = BeautifulSoup(html, "lxml")
    data = bsObj.find("table", {"class":"sortable wikitable"})
    table_data = data.find_all("td", string=re.compile("^[A-Za-z]+\ [0-9]+,\ [0-9]+"))
    print(table_data)

getDates("https://en.wikipedia.org/wiki/List_of_Academy_Awards_ceremonies")

结果集如下所示：

[1929年5月16日，1930年4月3日，1930年11月5日，1931年11月10日，1932年11月18日，1934年3月16日，1935年2月27日，1936年3月5日，1937年3月4日，3月1938年2月23日，1940年2月23日，1940年2月29日，1941年2月27日，1942年2月26日，1943年3月4日，1944年3月2日，1944年3月15日，1946年3月7日，1947年3月13日，3月1949年3月24日，1949年3月24日，1950年3月23日，1951年3月29日，1952年3月20日，1953年3月19日，1954年3月25日，1955年3月30日，1956年3月21日，1957年3月27日，3月1958年4月6日，1959年4月6日，1961年4月4日，1961年4月9日，1962年4月9日，1963年4月8日，1964年4月13日，1965年4月5日，1966年4月18日，1967年4月10日，4月1968年4月14日，1969年4月14日，1970年4月7日，1971年4月15日，1972年4月10日，1973年3月27日，1974年4月2日，1975年4月8日，1976年3月29日，1977年3月28日，4月1978年3月9日，1979年4月9日，1980年4月14日，1981年3月31日，1982年3月29日，1983年4月11日，1984年4月9日，1985年3月25日，1986年3月24日，1987年3月30日，4月1988年3月29日，1988年3月26日，1988年11月11日， 1991年3月25日，1992年3月30日，1993年3月29日，1994年3月21日，1995年3月27日，1996年3月25日，1997年3月24日，1998年3月23日，1999年3月21日，2000年3月26日， 2001年3月25日，2002年3月24日，2003年3月23日，2004年2月29日，2005年2月27日，2006年3月5日，2007年2月25日，2008年2月24日，2009年2月22日，2010年3月7日， 2011年2月27日，2012年2月26日，2013年2月24日，2014年3月2日，2015年2月22日，2016年2月28日，2017年2月26日]

使用BeautifulSoup访问表数据

3 个答案: