将正则表达式与html标签结合

时间:2019-03-25 11:38:02

标签: python html regex

我在html页面中有以下文字:

page = 
"""
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

我想找到获得第1项业务与第1A项风险因素之间的文本。我不能使用beautifulsoup,因为每个页面都有不同的html标记结构。我使用以下代码获取文本,但是它不起作用:

regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.',   #<===pattern 1: with an attribute bold before the item subtitle
              'b>\s*Item 1\.(.+?)b>\s*Item 1A\.',               #<===pattern 2: with a tag <b> before the item subtitle
              'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>',         #<===pattern 3: with a tag <\b> after the item subtitle          
              'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle 

for regex in regexs:
    match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL)  #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
    if match:
        soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?) 

            #soup.text removes the html tags and only keep the texts
            #rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
        rawText = soup.text
        print(rawText)
        break

预期输出为:

Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

我认为,第一个正则表达式应该与模式匹配,但不匹配

编辑:这是实际的htm页面和检索文本的方式:

# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
print(response.status_code)

page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip()  #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace('&nbsp;', ' ') #<===replace "&nbsp;" (a special character for space in HTML) with space. 
page = page.replace('&#160;', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
while '  ' in page:
    page = page.replace('  ', ' ') #<===remove extra space

5 个答案:

答案 0 :(得分:0)

如下所示?

import re
page =  """
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

data = re.search('Item 1\. Business\/<\/font> (.*)(<font(.*)">Item 1A. Risk Factors)', page, flags=re.DOTALL).group(1)
print(data)

答案 1 :(得分:0)

如果更改正则表达式该怎么办?

regexs = ('Item 1\.\s*Business\/(.*)',
          'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b')

行得通吗?

答案 2 :(得分:0)

我首先通过贪婪地隔离所有类型的序列来“解析” HTML

<font[^>]*>([^<>]*)</font>([^<>]+)

这会给我类似的东西,

( 'Item 1. Business/', 'Unless otherwise indicated ... CT 06828-0001.' ),
( 'Item 1A. Risk Factors', '...')

并处理您的注释“有时在文本中使用“ <项目>业务”和“项目1A风险因素” ”指出的问题。在这里,文本只能是每个元组的第二个元素,而您基本上完全忽略了它。

然后,我将检查每个比赛的第一个元素中的内容,以识别“项目1”。与“项目1A”。捕获周期将在找到第一个关键字后立即开始,跳过关键字本身,然后停止查找第二个关键字。

答案 3 :(得分:0)

所以,我在正则表达式中尝试了NOT TO USE "<font>",因为您说它可能会有所不同,所以我希望它能起作用。不过,在您的情况下,有很多方法可以破坏正则表达式,因为在很多情况下(肯定是您的情况),实际上不应使用正则表达式来解析XML

>>> import re



>>> string  = '''
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>'''




>>> result = re.findall('Item[\s]*1.[\s]*Business[/<]*[\S]*?[>]*[\s]+([\S\s]+?)[/<]+[\S\s]*?[>]*?Item 1A. Risk Factors', string)





#Output
>>> print(result[0])
Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

答案 4 :(得分:0)

因此,鉴于该页面的全文,您可能会遭受重创。老实说,您对问题的描述非常容易引起误解,但无论您是谁,这可能都是您想要的,但它是大事

>>> import re
>>> import requests


>>> page = requests.get("https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm").text



>>> segment_of_page = re.findall('(?i)align=[\"]*center[\"]*[\S\ ]+?Part[\s]*I(?!I)[\S\s]+?Item[\S\s]*?1(?![\d]+)[\S\s]{1,50}Business[\S\s]{40,}?>Item[\S\s]{1,50}1A\.[\S\s]{1,50}(?=Risk)', page)



>>> parsed_data_sets = []



>>> for i in range(len(segment_of_page)):
        if len(segment_of_page[i]) > 35:
            parsed_data = re.findall('(?:<[\S\s]+?>)+([\S\s]+?)(?=<[\S\s]+?>)+', segment_of_page[i])
            for ii in range(len(parsed_data)):
                parsed_data_sets.append(parsed_data[ii])


>>> for i in range(len(parsed_data_sets)):
        if len(parsed_data_sets[i]) > 35:
            print('\n\n\n===============\n\n')
            print(parsed_data_sets[i])





#Output
===============


Unless otherwise indicated by the context, we use the terms &#8220;GE&#8221; and &#8220;GECC&#8221; on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K Report. Also, unless otherwise indicated by the context, &#8220;General Electric&#8221; means the parent company, General Electric Company (the Company).


===============


General Electric&#8217;s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.


===============


We are one of the largest and most diversified infrastructure and financial services corporations in the world. With products and services ranging from aircraft engines, power generation, oil and gas production equipment, and household appliances to medical imaging, business and consumer financing and industrial products, we serve customers in more than 100 countries and employ approximately 305,000 people worldwide. Since our incorporation in 1892, we have developed or acquired new technologies and services that have broadened and changed considerably the scope of our activities.


===============

自您上次提取字符串以来,某些文档已更改,但请告诉我是否可行。