Question

分为两部分问题。

首先，这是我目前正在尝试使用的代码。我想在BS上坚持使用lxml。

import requests
import lxml.etree
from requests.auth import HTTPBasicAuth

r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))

root = lxml.etree.fromstring(r.content)
results = root.findall('entry')
textnumbers = [r.find('updated').text for r in results]
print (textnumbers)

输出只给我[]

和我正在使用的XML数据：

<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
  <title>Feed from some link here</title>
  <link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
  <link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
  <id>https://somelinkhere/folder/?parameter=abc</id>
  <updated>2018-03-06T17:48:09Z</updated>
  <dc:creator>company.com</dc:creator>
  <dc:date>2018-03-06T17:48:09Z</dc:date>
  <opensearch:totalResults>4</opensearch:totalResults>
  <opensearch:startIndex>1</opensearch:startIndex>
  <entry>
    <title>123456789</title>
    <link rel="alternate" href="https://somelink/ticket/123456789" />
    <author>
      <name>usernameHere</name>
    </author>
    <id>https://somelink/ticket/123456789</id>
    <updated>2018-02-28T13:27:33Z</updated>
    <content>short_description$$$someTextHere</content>
    <summary>some summary here</summary>
    <dc:creator>usernameHere</dc:creator>
  </entry>
  <entry>
    <title>123456799</title>
    <link rel="alternate" href="https://somelink/ticket/123456799" />
    <author>
      <name>usernameHere</name>
    </author>
    <id>https://somelink/ticket/123456799</id>
    <updated>2018-03-20113:27:33Z</updated>
    <content>short_description$$$someTextHere</content>
    <summary>some summary here</summary>
    <dc:creator>usernameHere</dc:creator>
  </entry>

我要做的第一件事就是从<entry> - ＆gt;获取日期。 <updated>字段。

第二部分是计算独特日期。所以，如果我得到以下日期 2018年2月27日
2018年2月27日
2018年2月28日
2018-03-01

我的数量是3.

然而，第二部分只是奖金。我更关注如何从XML中获取这些值，我不知道该怎么做。

Answer 1

from bs4 import BeautifulSoup

soup = BeautifulSoup(r,'xml')
updates = soup.findAll('updated')
for update in updates:
    print(update.contents)

那应该有用。试试看。如果它不能按原样运作，那么它应该只需要很少的修补。

修改：find()只会发现第一次出现findAll()或find_all()会找到所有

Answer 2

解析XML结构时需要合并现有的命名空间：

root = lxml.etree.fromstring(r.content)
ns = {'ns': 'http://www.w3.org/2005/Atom'}
dates = [d.text for d in root.findall('ns:entry/ns:updated', ns)]

print(dates)

输出：

['2018-02-28T13:27:33Z', '2018-03-20113:27:33Z']

获取XML数据中的唯一值

2 个答案: