Question

我正在抓一个网站，我能够将一个名为“性别”的变量减少到：

[<span style="text-decoration: none;">
                        Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                    </span>, <span style="text-decoration: none;">associé gérant </span>]

现在我想在变量中只有 “associé”，但我找不到分割这个html代码的方法。

原因是我想知道它是“男性”（男性）还是“结合 e ”（女性）。

有没有人有任何想法？

干杯

-----编辑---- 这里我的代码获取了html输出

url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"

r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")


output = table.select("td span[style^=text-decoration:]", limit=2)  #.text.split(",", 1)[0].strip()

print(output)

Answer 1

无论两个元素的父级是什么，您都可以调用span:nth-of-type(2)来获得第二个范围，然后只需检查文本：

html = """<span style="text-decoration: none;">
                        Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                    </span>
           <span style="text-decoration: none;">associé gérant </span>"""

soup = BeautifulSoup(html)

text = soup.select_one("span:nth-of-type(2)").text

或者如果它不总是第二个跨度，您可以通过部分文本associé搜索范围：

import re
text = soup.find("span", text=re.compile(ur"associé")).text

对于您的编辑，您只需要提取文本的最后一个元素并使用.split(None, 1)[1]来获取性别：

text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant

网页数据抓取：拆分html内容

1 个答案: