我已经删除了一个包含表格的网站,我想格式化我想要的最终版本的标题。
headers = []
for row in table.findAll('tr'):
for item in row.findAll('th'):
for link in item.findAll('a', text=True):
headers.append(link.contents[0])
print headers
返回:
[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n 2016\r\n ']
我不想要"点击这里阅读更多'或者' 2016'标题所以我已经完成了以下工作:
for idx, i in enumerate(headers):
if 'Click' in i:
del headers[idx]
for idx, i in enumerate(headers):
if '2016' in i:
del headers[idx]
返回:
[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']
完美。但是有更好/更简洁的方法来删除不需要的物品吗?谢谢!
答案 0 :(得分:3)
headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)
如果你想更通用:
banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)
答案 1 :(得分:2)
您可以考虑使用列表推导来获取新的过滤列表,例如:
new_headers = [header for header in headers if '2016' not in header]
答案 2 :(得分:1)
pattern = '^Click|^2016'
new = [x for x in header if not re.match(pattern,str(x).strip())]
答案 3 :(得分:1)
如果您确定'2016'
将永远是最后一个:
>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']