Question

我有一长串使用Python 3中的Beautiful Soup生成的列表。

现在，列表是这样生成的。

mylist = [a['href'] for a in soup.find_all('a', href=True) if a.text]

这是Web抓取的事情，但只知道它会返回一个列表。

并以列表的形式返回以下结果：

'catalogue / category / books / travel_2 / index.html'，

'catalogue / category / books / mystery_3 / index.html'，

'catalogue / category / books / historical-fiction_4 / index.html'

在打印列表之前，我想删除各种无用的信息（例如“目录/”，“类别/”和“书籍/”，以便仅显示重要信息（旅行，神秘或历史信息）小说）。

我能够使用以下方法成功替换一件东西：

mylist = [item.replace("catalogue/category/", "") for item in mylist]

哪个工作出色。但是我不相信.replace将接受两个以上的参数，这使我无法从结果中删除其他内容，例如“ index.html”。我不想为要替换的所有内容写那行。这就是为什么我试图将字典中的键和值用作.replace（）参数的原因：

replacedict = {"catalogue/category/": "" , "index.html": ""}

mylist = [a['href'] for a in soup.find_all('a', href=True) if a.text]

def replace_all(mylist, replacedict):
     for k, v in replacedict.items():
         mylist = [item.replace(k, v) for item in mylist]
     return mylist

replace_all(mylist, replacedict)

print(mylist)

目前，该程序在运行时未引发任何错误。但这也根本不符合我的要求。它只是返回上面显示的大量结果，而没有删除或替换任何内容。

非常困惑，尽管我确定答案就在我眼前。

感谢所有帮助，在任何地方都找不到这样的问题。

Answer 1

为什么不通过将字符串拆分为字符串列表来获得您感兴趣的每个URL的一部分。例如：

$ python
Python 3.7.2 (default, Dec 27 2018, 07:35:06) 
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> string_list = ['catalogue/category/books/travel_2/index.html', 'catalogue/category/books/mystery_3/index.html', 'catalogue/category/books/historical-fiction_4/index.html']
>>> array_list = [s.split('/') for s in string_list]
>>> array_list
[['catalogue', 'category', 'books', 'travel_2', 'index.html'], ['catalogue', 'category', 'books', 'mystery_3', 'index.html'], ['catalogue', 'category', 'books', 'historical-fiction_4', 'index.html']]
>>> [a[3] for a in array_list]
['travel_2', 'mystery_3', 'historical-fiction_4']

如果URL始终按照您显示的方式进行结构化，那应该可以。

Answer 2

使用正则表达式如何？

e.Cancel = true

输出：

Validate

使用字典切出列表中的部分字符串

2 个答案: