我正在使用python Webscrap网站上的所有“ a”标签。在“ a”标签中,我想选择一些单词并将其存储

时间:2019-05-27 05:37:34

标签: python web-scraping

链接标签“ a”具有以下文本:“美浓游戏(YC W11)正在聘请QC蒙特利尔的高级工程师(workable.com)”

我想将“ Mino Games”,“高级工程师”,“蒙特利尔”和“ workable.com”存储在sqlite3中。

请提出建议,我该怎么办。

2 个答案:

答案 0 :(得分:0)

假设您要抓取https://news.ycombinator.com/jobs,这应该可以:

import re, sqlite3

conn = sqlite3.connect('jobs.db')
c = conn.cursor()
c.execute('''CREATE TABLE jobs
         (company text, position text, location text, source real)''')

company_pattern = re.compile(r'(.+)(hiring|looking|wants|is )', re.IGNORECASE)
source_pattern = re.compile(r'\(([^)]+)\)$')
location_pattern = re.compile(r'in (.*)|(remote)', re.IGNORECASE)
position_pattern = re.compile(r'(?:hiring|looking|wants) (.*)', re.IGNORECASE)
clean_up_pattern = re.compile(r'\(([^)]+)\)| is | for | in |a ', re.IGNORECASE)

# Load up <a> nodes into elements here

for element in elements:
    element = element.text
    source = source_pattern.findall(element)[0].strip()
    element = element.replace('(' + source + ')', '')
    company = clean_up_pattern.sub('', company_pattern.findall(element)[0][0])
    try:
        location = location_pattern.findall(element)[0][0].strip()
    except IndexError:
        location = 'Not stated'
    element = element.replace(location, '')
    position = clean_up_pattern.sub('', position_pattern.findall(element)[0])

    c.execute("INSERT INTO jobs VALUES (company, position, location, source)")

conn.commit()
conn.close()

这将解析那里约80%的工作机会。如果需要捕获更多内容,请调整正则表达式。

答案 1 :(得分:0)

这是我使用的代码,但是不起作用。我是初学者:(

import requests
import re
from bs4 import BeautifulSoup
import sqlite3

requests = requests.get('https://news.ycombinator.com/jobs?next=19916090')

soup = BeautifulSoup(requests.text, 'html.parser')

conn = sqlite3.connect('job6.db')
c = conn.cursor()
c.execute('''CREATE TABLE job6
         (company text, position text, location text, source real)''')


company_pattern = re.compile(r'(.+)(hiring|looking|wants|is )', re.IGNORECASE)
source_pattern = re.compile(r'\(([^)]+)\)$')
location_pattern = re.compile(r'in (.*)|(remote)', re.IGNORECASE)
position_pattern = re.compile(r'(?:hiring|looking|wants) (.*)', re.IGNORECASE)
clean_up_pattern = re.compile(r'\(([^)]+)\)| is | for | in |a ', re.IGNORECASE)

# # Load up <a> nodes into elements here

 for item in soup.select('.storylink'):
   elements = item

for element in elements:
    element = element
    source = source_pattern.findall(str(element))[0].strip()
    element = element.replace('(' + source + ')', '')
    company = clean_up_pattern.sub('', company_pattern.findall(element)[0][0])
    try:
        location = location_pattern.findall(element)[0][0].strip()
    except IndexError:
        location = 'Not stated'
    element = element.replace(location, '')
    position = clean_up_pattern.sub('', position_pattern.findall(element)[0])

c.execute("INSERT INTO job6 VALUES (company, position, location, source)")

conn.commit()
conn.close()