Question

我正在使用漂亮的汤并要求从网页上记下信息，我正在尝试获取只是标题的书名列表，并且不包括标题字体中的文本标题。

示例text ='一堆垃圾标题= book1更多垃圾文本标题= book2'

我得到的是titleList = ['title = book1'，'title = book2']

我想要titleList = ['book1'，'book2']

我尝试过匹配组，这确实打破了单词title =和book1，但我不知道如何将组（2）附加到列表中。

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)

Answer 1

你的正则表达式没有捕获组。您还应该注意.Where(x => x.custom_fields.Any(y => y.Name == "color" && y.Value == "red"));会返回一个列表，因此您应该使用findall而不是extend（除非您希望append成为列表列表）。

titleList

一个独立的例子：

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

Answer 2

将re.findall与捕获组一起使用即可：

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

Python正则表达匹配，但不包括人物美丽的汤

2 个答案: