Question

我正在拼凑一些东西，尝试使用beautifulsoup get_text从网站上获取干净的文字。在过去，我发现它经常会带来一些不是我需要的东西，所以我一直试图让它尽可能干净。我的问题是，在返回的内容中，我得到了一些空白值。我的代码如下：

def GetPageText():
    for page in GetTeamLinks():
        headers = {'User-Agent': 'Mozilla/5.0'} # some websites look for these sorts of headers to make sure you're not a bot
        response = requests.get(page, verify=False, headers=headers) ##go to each of the websites in the domain list
        soup = BeautifulSoup(response.text, "html.parser") # sets "soup" as their variable name
        for script in soup(["script", "style","a","nav", "footer"]): #find everything in the script or style tags
            script.extract()    # rip it out
        full_text = str(soup.get_text().splitlines()).strip() #set the variable 'full_text' as the text we get back
    return(full_text)

返回的内容如下所示（这是抓取https://www.nutmeg.com/about/executive-team）

的示例

['', '', '', '', '', '', '', '', '', 'Executive team | Nutmeg - Nutmeg', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '      
Executive team', '', '  ', '', '', '', 'The Nutmeg executive team', '', '', 
'', '', '', '', '', '', '', '', 'Martin Stead ', 'Chief executive officer', 
'', 'Martin joined Nutmeg in 2015. He has a range of experience running and 
jointly-running...........]

我想摆脱

 '', '', '', '',

值。

我尝试将full_text视为列表，然后查看该列表并删除少于2个字符的所有值。但是，这似乎不适用于我的for语句，因为它无法识别full_text。

非常感谢任何帮助。我搜索过，但一直无法找到答案。如果这里有类似的东西，请指出我的方向。

非常感谢

罗布

Answer 1

我希望我理解你的问题。您可以使用列表解析来消除空值：

my_list = ['', '', '', 'Executive team | Nutmeg - Nutmeg']

new_list = [i for i in my_list if i != '']

print(new_list)

我不知道您之后想要对数据做什么，但尝试专门抓取数据以了解您的数据似乎更容易。

从beautifulsoup get_text中删除空格

1 个答案: