在python中剪切字符串变量的一部分(web scraping)

时间:2016-12-31 22:03:41

标签: python

我试图抓一个网站,所以我设法使用这个模板提取我想要的所有文字:

nameList = bsObj.findAll("strong")
for text in nameList:
    string = text.get_text()
    if "Title" in string:
        print(text.get_text())

我以这种方式获得文本:

标题1:textthatineed

标题2:textthatineed

标题3:textthatineed

标题4:textthatineed

标题5:textthatineed

标题6:textthatineed

标题7:textthatineed ....

有没有办法可以使用beautifulsoup或任何其他方式在python中剪切字符串,并且只获得没有“title(number):”的“textthatineed”。

2 个答案:

答案 0 :(得分:1)

在Python中,可以对名为slicing的字符串执行非常方便的操作。

取自docs

的示例
cats.syntax._

所以在你的情况下你会做这样的事情

>>> word = 'Python'
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5]  # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2]   # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]   # characters from position 4 (included) to the end
'on'
>>> word[-2:]  # characters from the second-last (included) to the end
'on'

答案 1 :(得分:1)

说我们有

s = 'Title 1: textthatineed'

标题在冒号后面开始两个字符,所以我们找到冒号的索引,向下移动两个字符,并从该索引中取出子串到结尾:

index = s.find(':') + 2
title = s[index:]

请注意,find()仅返回第一次出现的索引,因此包含冒号的标题不受影响。