我正在拼凑一些东西,尝试使用beautifulsoup get_text从网站上获取干净的文字。在过去,我发现它经常会带来一些不是我需要的东西,所以我一直试图让它尽可能干净。我的问题是,在返回的内容中,我得到了一些空白值。我的代码如下:
def GetPageText():
for page in GetTeamLinks():
headers = {'User-Agent': 'Mozilla/5.0'} # some websites look for these sorts of headers to make sure you're not a bot
response = requests.get(page, verify=False, headers=headers) ##go to each of the websites in the domain list
soup = BeautifulSoup(response.text, "html.parser") # sets "soup" as their variable name
for script in soup(["script", "style","a","nav", "footer"]): #find everything in the script or style tags
script.extract() # rip it out
full_text = str(soup.get_text().splitlines()).strip() #set the variable 'full_text' as the text we get back
return(full_text)
返回的内容如下所示(这是抓取https://www.nutmeg.com/about/executive-team)
的示例['', '', '', '', '', '', '', '', '', 'Executive team | Nutmeg - Nutmeg',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '
Executive team', '', ' ', '', '', '', 'The Nutmeg executive team', '', '',
'', '', '', '', '', '', '', '', 'Martin Stead ', 'Chief executive officer',
'', 'Martin joined Nutmeg in 2015. He has a range of experience running and
jointly-running...........]
我想摆脱
'', '', '', '',
值。
我尝试将full_text视为列表,然后查看该列表并删除少于2个字符的所有值。但是,这似乎不适用于我的for语句,因为它无法识别full_text。
非常感谢任何帮助。我搜索过,但一直无法找到答案。如果这里有类似的东西,请指出我的方向。
非常感谢
罗布
答案 0 :(得分:0)
我希望我理解你的问题。 您可以使用列表解析来消除空值:
my_list = ['', '', '', 'Executive team | Nutmeg - Nutmeg']
new_list = [i for i in my_list if i != '']
print(new_list)
我不知道您之后想要对数据做什么,但尝试专门抓取数据以了解您的数据似乎更容易。