Question

我目前正在尝试抓取网站以获取某些信息，但遇到一些问题。

我目前有一个bs4.element.Tag元素，其中包含一些html和文本，当我执行“ variable.text”时，得到以下文本：

\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t

我想要的是摆脱所有空白字符（\n和\t），以列表或任何可迭代的形式获取相关信息。

我已经尝试了很多正则表达式命令，但是最接近我的目标的是：re.split('[\t\n]',variable.text)，我得到了以下内容：

['',
 '',
 'Ulmstead Club',
 '',
 '',
 '',
 '',
 '',
 '911 Lynch Dr',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Arnold, Maryland',
 '',
 '',
 '',
 '',

为了节省空间，我切断了很多输出。

我很迷失，任何帮助将不胜感激

Answer 1

尝试在<body> <main>  <div id="landing"> <div id="landing-image"> <i><img src="/img/diamond"></i> </div> </div>  <section id="navigation"> </section>  <section id="developer"> </section>  <section id="projects"> </section>  <section id="contact"> </section> </main> </body>上进行拆分：

[\t\n]+

这似乎可行，因为它将消除输出数组中的空字符串条目。

Answer 2

我的猜测是，这个简单的表达式也可能会有所帮助，

$usernames = array("John Kennedy", "Barrack Ohbama", "Abraham Lincon");
#                                                                     ^

Demo

测试

(?:\\n|\\t)

Answer 3

您可以使用string.replace()函数来摆脱\ n和\ t，实际上并不需要正则表达式（下一步，我用2个空格替换了\ n和\ t）：

variable.text = variable.text.replace("\n","  ")
variable.text = variable.text.replace("\t","  ")

如果您想将数据拆分为一个列表，则可以通过空格将其拆分，然后使用remove()删除列表中的所有多余的空字符串（请注意，我不确定100％关于如何分离数据，我刚刚制定了符合我的数据分离逻辑的解决方案）：

result = re.split("[\s]\s+",variable.text)
while ('' in result):
    result.remove('')

这是完整的代码示例：

import re    

teststring ="\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t"

teststring = teststring.replace("\n","  ")
teststring = teststring.replace("\t","  ")

#split any fields with more than 1 whitespace between them
result = re.split("[\s]\s+",teststring)

#remove any empty string fields of the list
while ('' in result):
    result.remove('')

print(result)

结果是：

['Ulmstead Club', '911 Lynch Dr', 'Arnold, Maryland', '21012', 'United States', '(410) 757-9836', 'Get directions', 'Favorite court', 'Tennis Court Details', 'Location type:', 'Club', 'Matches played here:', '0']

Answer 4

我会在从1开始然后为2的字符串上运行2个正则表达式

找到\s*(?:\r?\n)\s*
替换\n

https://regex101.com/r/EGTyKB/1

找到[ ]*\t+[ ]*
替换\t

https://regex101.com/r/XIyi44/1

这将清除所有空白残渣并将其变成
可读的文本块。

Ulmstead Club
911 Lynch Dr
Arnold, Maryland 21012
United States
(410) 757-9836
Get directions
Favorite court
Tennis Court Details
Location type:
Club
Matches played here:
0

使用正则表达式格式化字符串以删除非空格空格字符

4 个答案:

Demo

测试