我试图抓一个网站,所以我设法使用这个模板提取我想要的所有文字:
nameList = bsObj.findAll("strong")
for text in nameList:
string = text.get_text()
if "Title" in string:
print(text.get_text())
我以这种方式获得文本:
标题1:textthatineed
标题2:textthatineed
标题3:textthatineed
标题4:textthatineed
标题5:textthatineed
标题6:textthatineed
标题7:textthatineed ....
有没有办法可以使用beautifulsoup或任何其他方式在python中剪切字符串,并且只获得没有“title(number):”的“textthatineed”。
答案 0 :(得分:1)
在Python中,可以对名为slicing的字符串执行非常方便的操作。
取自docs
的示例cats.syntax._
所以在你的情况下你会做这样的事情
>>> word = 'Python'
>>> word[0:2] # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5] # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2] # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:] # characters from position 4 (included) to the end
'on'
>>> word[-2:] # characters from the second-last (included) to the end
'on'
答案 1 :(得分:1)
说我们有
s = 'Title 1: textthatineed'
标题在冒号后面开始两个字符,所以我们找到冒号的索引,向下移动两个字符,并从该索引中取出子串到结尾:
index = s.find(':') + 2
title = s[index:]
请注意,find()
仅返回第一次出现的索引,因此包含冒号的标题不受影响。