使用BeautifulSoup

时间:2019-07-13 19:03:30

标签: python html beautifulsoup

我正在尝试从HTML文件中的每个重复标签中提取多个因素。

...

<div class="title">
    <a target="_blank" id="jl_fe575975c912af9e" href="https://www.indeed.com/company/Nestvestor/jobs/Data-Science-Intern-fe575975c912af9e?fccid=8eed076a625928e7&amp;vjs=3" onmousedown="return rclk(this,jobmap[0],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[0],true,0);" rel="noopener nofollow" title="Data Science Intern" class="jobtitle turnstileLink " data-tn-element="jobTitle">
        Data Science Intern</a>

    </div>

<div class="sjcl">
    <div>
<span class="company">
    Nestvestor</span>

</div>
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="p_9cfaca3374641aa0" data-jk="9cfaca3374641aa0" data-tn-component="organicJob">

<div class="title">
    <a target="_blank" id="jl_9cfaca3374641aa0" href="https://www.indeed.com/rc/clk?jk=9cfaca3374641aa0&amp;fccid=1779658d5b4ae2b0&amp;vjs=3" onmousedown="return rclk(this,jobmap[1],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[1],true,0);" rel="noopener nofollow" title="Product Manager" class="jobtitle turnstileLink " data-tn-element="jobTitle">
        Product Manager</a>

    </div>

<div class="sjcl">
    <div>
<span class="company">
    <a data-tn-element="companyName" class="turnstileLink" target="_blank" href="https://www.indeed.com/cmp/Sojern" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=9cfaca3374641aa0&amp;jcid=1779658d5b4ae2b0')" rel="noopener">
    Sojern</a></span>

...

soup = BeautifulSoup(open(input("Enter a file to read: ")), "html.parser")


title = soup.find_all('div', class_='title')
for span in title:
    print(span.text)

company = soup.find_all('span', class_='company')
for span in company:
    print(span.text)

到目前为止,我已经弄清楚了如何获得以下结果:

Job_Title1

Job_Title2

Job_Title3

并在不同的代码结果中:

Company_name1

Company_Name2

Company_Name3

如何通过一轮代码使结果看起来像这样:
Job_Title1,Company_Name1,
Job_Title2,Company_Name2,
Job_Title3,Company_Name3,

2 个答案:

答案 0 :(得分:0)

根据您所拥有的内容,您似乎需要嵌套循环。没有该网站,很难说,但我会尝试这样的事情。

    company = soup.find_all('span', class_='company')
    title = soup.find_all('div', class_='title')
    for span in title:
        for x in company:
   print(x.text,span.text)

答案 1 :(得分:0)

欢迎使用堆栈溢出,只需使用以下方法即可:

a = [{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}]
toChange = [[1, 10], [4, 76]]  # 1 and 4 are the keys, and 10 and 76 are the 
values to change them to
for i, n in enumerate(a):
    if i == 0:
        for change in toChange:
            try:
                oldValue = a[0][change[0]]
                del a[0][change[0]]
                a[0][change[1]] = oldValue
            except:
                pass # handle it here
                #This likely means you tried to replace a key that isn't in there
print(a)