使用python魅力对网页进行爬网

时间:2020-02-15 01:58:46

标签: python web-crawler

[在此处输入图片描述] [1] id想要抓取网页上的主题名称,而单词会找到我决定在网页中搜索的单词的所有单词。到目前为止,我的代码无法正常工作

import requests
import csv
from bs4 import BeautifulSoup
start_urls = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(start_urls)
soup = BeautifulSoup(r.content, 'html.parser')
crawled_page =[]
for page in soup.findAll('data'):
  crawled_page.append(page.get('href'))
print(crawled_page


Errormessage:
C:\Users\tette\PycharmProjects\WebcrawlerProject\venv\Scripts\python.exe 
C:/Users/tette/PycharmProjects/WebcrawlerProject/webScrapy/webScrapy/spiders

/webcrawler.py []

Process finished with exit code 0

1 个答案:

答案 0 :(得分:1)

如果要搜索文本中的单词,则应使用

NavigableString

但是它找到的是字符串(href,而不是标签,因此您可能必须让它们成为父级才能搜索诸如import requests from bs4 import BeautifulSoup, NavigableString import re start_urls = 'https://en.wikipedia.org/wiki/Data_science' r = requests.get(start_urls) soup = BeautifulSoup(r.content, 'html.parser') crawled_page =[] for page in soup.findAll(string=re.compile('data')): #print(isinstance(page, NavigableString)) #print(page.parent) href = page.parent.get('href') if href: # skip None crawled_page.append(href) print(crawled_page) 的属性

lxml

编辑:类似于使用xpath的{​​{1}}

import requests
import lxml.html
import re

start_urls = 'https://en.wikipedia.org/wiki/Data_science'

r = requests.get(start_urls)

soup = lxml.html.fromstring(r.content)

crawled_page =[]

for page in soup.xpath('//*[contains(text(), "data")]'):
    href = page.attrib.get('href')
    if href: # skip None
        crawled_page.append(href)

print(crawled_page)