如何使用scrapy从python中的li&#span标记中获取电子邮件?

时间:2017-08-28 11:46:12

标签: python-2.7 scrapy

我正在放置HTML代码:

<div class="rendering rendering_person rendering_short rendering_person_short">
  <h3 class="title"><a rel="Person" href="https://moh-it.pure.elsevier.com/en/persons/paola-alberti" class="link person"><span>Paola Alberti</span></a></h3>
  <ul class="relations email">
    <li class="email"><a href="mailto:paola.alberti@istitutotumori.mi.it" class="link"><span>paola.alberti@istitutotumori.mi.it</span></a></li>
  </ul>
  <ul class="relations organisations">
    <li><a rel="Organisation" href="https://moh-it.pure.elsevier.com/en/organisations/fondazione-irccs-istituto-nazionale-dei-tumori" class="link organisation"><span>Fondazione IRCCS Istituto Nazionale dei Tumori</span></a></li>
  </ul>
  <p class="type"><span class="family">Person: </span>Academic</p>
</div>

如何从上面的span标签中获取电子邮件...

<span>paola.alberti@istitutotumori.mi.it</span>

1 个答案:

答案 0 :(得分:1)

您可以使用XPath:

awk 'NR==FNR{A[$1];next}$1 in A else { print "unknown" }' file1 file2

awk 'BEGIN{FS=OFS="\t"}  # define field and output seperators
    FNR==NR{ # process each field in line of `file1`
         for (i=1; i <= n; i++) {   # execute loop
         d[$1] = $1  # match first element and read into key d
  }
}
  next   # process next line
}{print $1, ($1 in d?d[$1]:"unknown")}' file1 file2 # if no match 
print $1 followed by unknown