从一个链接中的多个链接构造表

时间:2019-03-25 16:07:49

标签: python web-scraping beautifulsoup python-requests

我需要从网站中提取数据,该网站已提取了托管数据的URL列表,并且能够提取数据,但无法以表格形式提取数据。

我尝试了多种代码,提取了href链接,然后将其附加到列表中。我正在使用请求和漂亮的汤库来提取数据。

url = 'https://www.flinders.edu.au/directory/index.cfm/search/results?page=1&lastnamesearch=A&firstnamesearch=&ousearch='

for rows in df_link['Name']:
url = rows
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
for table in soup.find_all('table', {'summary' : 'Staff list that match search criteria'}):
    n_columns = 0
    n_rows = 0
    column_names = []

    column_names = [th.get_text() for th in table.select('th')]
    n_columns = len(column_names)

    rows = table.select('tr')[1:]
    n_rows = len(rows)

    df = pd.DataFrame(columns=column_names, index=range(n_rows))

    r_index = 0
    for row in rows:
        c_index = 0
        for cell in row.select('td'):
            anchor = cell.select_one('a')
            df.iat[r_index, c_index] = anchor.get('href') if anchor else cell.get_text()

            c_index += 1
        r_index += 1

    #c_index = 1
    #for nam in row.find_all('a', {'class' : 'directory directory-person'}):

     #   df.iat[r_index, c_index] = nam.get_text()

      #  c_index += 1
    #r_index += 1

    print(df)

urls = []
for row in df['Name\xa0⬆']:
   urls.append(link+row)

for url in urls:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    for name in soup.find_all('span' , {'class' : 'directory directory-entity'}):
        results['Name'] = table.text
    p = []
    for row in soup.find_all('tr'):
        position = row.find_all('td')
        p.append(position[0].text)
        results['Position'] = p[1]
        results['Phone'] = p[4]
        results['Email'] = p[9].replace('\n', '')
    print(results)

我期望结果以表格形式出现。援助将不胜感激

1 个答案:

答案 0 :(得分:0)

您可以使用pandas和BeautifulSoup 4.7.1执行以下操作。

javax.servlet.ServletContext log: No runtime JspServlet available for /app-servlet/

示例输出:

enter image description here


如果您在名称和电话方面遇到问题,还可以执行以下操作:

$('#vize-islem').hide();
$('#vize-bilgi').hide();
$('#vize-ulke').on('change', function(e) {
    let self = $(e.target);
    self.next('a').attr('href', self.find(':selected').data('url')).html('<i class="fas fa-external-link-alt"></i>' + self.find(':selected').text());
    $('#vize-islem').hide();
    $('#vize-bilgi').hide();
    $.get('vize/vizeislem/' + self.val(), function(result) {
        $('#vize-islem option').remove();
        $('#vize-islem').append('<option value="">Seçiniz</option>');
        if ( result.length > 0 ) {
            $('#vize-islem').show('slow');
            for (let key in result) {
                let kategori = result[key];
                console.log(kategori);
                $('#vize-islem').append(`<option value="${kategori.id}" data-url="${kategori.url}">${kategori.title}</option>`);
            }
        }
    }, 'json');
});