Question

我试图抓一个网站，所以我设法使用这个模板提取我想要的所有文字：

nameList = bsObj.findAll("strong")
for text in nameList:
    string = text.get_text()
    if "Title" in string:
        print(text.get_text())

我以这种方式获得文本：

标题1：textthatineed

标题2：textthatineed

标题3：textthatineed

标题4：textthatineed

标题5：textthatineed

标题6：textthatineed

标题7：textthatineed ....

有没有办法可以使用beautifulsoup或任何其他方式在python中剪切字符串，并且只获得没有“title（number）：”的“textthatineed”。

Answer 1

在Python中，可以对名为slicing的字符串执行非常方便的操作。

取自docs

的示例

cats.syntax._

所以在你的情况下你会做这样的事情

>>> word = 'Python'
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5]  # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2]   # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]   # characters from position 4 (included) to the end
'on'
>>> word[-2:]  # characters from the second-last (included) to the end
'on'

Answer 2

说我们有

s = 'Title 1: textthatineed'

标题在冒号后面开始两个字符，所以我们找到冒号的索引，向下移动两个字符，并从该索引中取出子串到结尾：

index = s.find(':') + 2
title = s[index:]

请注意，find()仅返回第一次出现的索引，因此包含冒号的标题不受影响。

在python中剪切字符串变量的一部分（web scraping）

2 个答案: