Question

我从HTML文件中提取了文本，并将整个内容都放在了一个字符串中。

我正在寻找一种遍历字符串的方法，仅提取方括号内的值并将字符串放入列表中。

我调查了几个问题，其中一个是Extract character before and after "/"

但是我很难修改它。有人可以帮忙吗？

已解决！

感谢您的所有投入，我肯定会更关注regex。我设法以一种非常手动的方式完成了自己想做的事情（可能并不漂亮）：

#remove all html code and append to string
for i in html_file:
    html_string += str(html2text.html2text(i))

#set this boolean if current character is either [ or ]
add = False

#extract only values within [ or ], based on add = T/F
for i in html_string:
    if i == '[':
        add = True
    if i == ']': 
        add = False
        clean_string += str(i)
    if add == True:
        clean_string += str(i)

#split string into list without square brackets
clean_string_list = clean_string.split('][')

我要获取的HTML文件是纯文本（稍后会在数据框中）而不是HTML，是我下载的我的个人Facebook数据。

Answer 1

尝试此正则表达式，给定一个字符串，它将把[]内的所有文本放入列表中。

import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']

这对于构建自己的正则表达式也是一个很好的参考。 https://regex101.com

编辑：如果您有嵌套的方括号，则此函数可以处理这种情况。

import re
test ='spam[eg[nested]gs][hello]'

def square_bracket_text(test_text,found):
    """Find text enclosed in square brackets within a string"""
    matches = re.findall(r'\[(\w+)\]',test_text)
    if matches:
        found.extend(matches)
        for word in found:
            test_text = test_text.replace('[' + word + ']','')
        square_bracket_text(test_text,found)
    return found

match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']

希望有帮助！

Answer 2

您也可以使用re.finditer()，请参见以下示例。

假设，我们在方括号内包含单词字符，因此正则表达式为\[\w+\]。

如果需要，请在https://rextester.com/XEMOU85362上进行检查。

import re

s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s) 
l = list() # or, l = []

for m in g: 
    text = m.group(0)
    l.append(text[1: -1]) 

print(l) # ['Programmer', 'Excellent']

提取某些符号内的字符

2 个答案: