使用Python和lxml进行网上抓取Strava

时间:2020-04-23 11:02:36

标签: python web-scraping lxml

我想从Strava获得俱乐部活动。我原本打算使用api和C#(我所知道的cos),但是由于api提供的信息不足,因此我转向这里的技术(https://twitter.com/OleksMaistrenko/status/1252251408495190018)。这是一种了不起的资源,使我90%地了解了那里。我现在正在尝试从HTML中获取更多信息,并且是一名完整的Python / lxml新手,我不知道该怎么做。

因此,为了加快活动进度,请使用以下HTML:

   <li title="Pace">
      "7:46"
      <abbr class="unit" title="minutes per mile"> /mi</abbr>
   </li>

由以下代码抓取:

activity_pace = activity.xpath(".//li[@title='Pace']")[0].text.strip()

Q1。那么,如何刮擦HTML以获得活动持续时间?

<li title="Time">
   "56"
   <abbr class="unit" title="minute">m</abbr>
    " 26"
   <abbr class="unit" title="second">s</abbr>
</li>

我尝试过,它只能获取分钟数:

activity_time = activity.xpath(".//li[@title='Time']")[0].text

第二季度。我想获取活动标题(在本例中为“晨跑”)。这是HTML:

<h3 class="entry-title activity-title" str-on="click" str-trackable- 
  id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
  <div class="entry-type-icon"><span class="app-icon-wrapper  "><span class="app-icon icon-run icon-dark 
  icon-lg"></span></span></div>
  <strong>
  <a href="/activities/3339847809">Morning Run</a>
  </strong>
</h3>

我已经确定可以用以下方法解决问题:

activity.xpath(".//h3[@class='entry-title activity-title']")[0]

但是在那之后我很沮丧:-(

1 个答案:

答案 0 :(得分:2)

它不是很优雅,但是可以通过以下方式完成:假设您的html看起来像这样:

activity = """
<doc>
  <h3 class="entry-title activity-title" str-on="click" str-trackable- 
  id="ChQIBTIQCIGRyLgMGAEwLDgAQABIARIECgIIBA==" str-type="self">
  <div class="entry-type-icon"><span class="app-icon-wrapper  "><span class="app-icon icon-run icon-dark 
  icon-lg"></span></span></div>
  <strong>
  <a href="/activities/3339847809">Morning Run</a>
  </strong>
</h3>
<li title="Time">
   "56"
   <abbr class="unit" title="minute">m</abbr>
    " 26"
   <abbr class="unit" title="second">s</abbr>
</li>
</doc>"""

import lxml.html
doc = lxml.html.fromstring(activity)

sports = doc.xpath("//h3[@class='entry-title activity-title']//a/text()")
duration = doc.xpath('//li[@title="Time"]')
abbrs = doc.xpath('//abbr[@class="unit"]')

for abbr in abbrs:
    abbr.text=''
for sport in sports:
    print(sport)
for d in dur:
    print(d.text_content().strip().replace('\n','').replace(' ','').replace('""',':'))

输出:

Morning Run
"56:26"