使用BeautifulSoup在带有标记的href上使用Regex的问题

时间:2016-11-22 23:54:05

标签: regex python-3.x web-scraping beautifulsoup

尝试从包含特定字符串的href标记中提取文本,下面是我的示例代码的一部分:

Experience = soup.find_all(id='background-experience-container')

Exp = {}

for element in Experience:
    Exp['Experience'] = {}


for element in Experience:
    role = element.find(href=re.compile("title").get_text()
    Exp['Experience']["Role"] = role


for element in Experience:
    company = element.find(href=re.compile("exp-company-name").get_text()
    Exp['Experience']['Company'] = company

它不喜欢我如何定义Exp['outer_key']['inner_key'] = value返回SyntaxError的语法。

我正在试图制作一个Dict.dict,其中包含有关角色和公司的信息,还会查看每个日期,但还没有到目前为止。

有人能在我的代码中发现任何明显的错误吗?

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

find_all可以返回多个值(即使您按id搜索),因此最好使用list保留所有值 - Exp = []

Experience = soup.find_all(id='background-experience-container')

# create empty list
Exp = []

for element in Experience:
    # create empty dictionary
    dic = {}

    # add elements to dictionary
    dic['Role'] = element.find(href=re.compile("title")).get_text()
    dic['Company'] = element.find(href=re.compile("exp-company-name")).get_text()

    # add dictionary to list
    Exp.append(dic)

# display

print(Exp[0]['Role'])
print(Exp[0]['Company'])

print(Exp[1]['Role'])
print(Exp[1]['Company'])

# or

for x in Exp:
    print(x['Role'])
    print(x['Company'])

如果您确定find_all只为您提供了一个元素(并且您需要键'Experience'),那么您可以

Experience = soup.find_all(id='background-experience-container')

# create main dictionary
Exp = {}

for element in Experience:
    # create empty dictionary
    dic = {}

    # add elements to dictionary
    dic['Role'] = element.find(href=re.compile("title")).get_text()
    dic['Company'] = element.find(href=re.compile("exp-company-name")).get_text()

    # add dictionary to main dictionary
    Exp['Experience'] = dic

# display

print(Exp['Experience']['Role'])
print(Exp['Experience']['Company'])

Experience = soup.find_all(id='background-experience-container')

# create main dictionary
Exp = {}

for element in Experience:
    Exp['Experience'] = {
       'Role': element.find(href=re.compile("title")).get_text()
       'Company': element.find(href=re.compile("exp-company-name")).get_text()
    }

# display

print(Exp['Experience']['Role'])
print(Exp['Experience']['Company'])