Question

我想从html的一部分中提取联系号码。我用了

import codecs 
from bs4 import BeautifulSoup 
page = codecs.open('D:/Edureka/Sample Addresses!.html', 'r+') 
page1 =page.read()
soup = BeautifulSoup(page1, 'html.parser')
soup.prettify()
for script in soup(["script", "style"]):
     script.extract()  
text = soup.get_text() 
lines = (line.strip() for line in text.splitlines()) 
chunks = (phrase.strip() for line in lines for phrase in line.split("  ")) 
text = '\n'.join(chunk for chunk in chunks if chunk)
contacts = re.findall("\d{3} \d{3}-\d{4}", text)
for j in contacts:
    print(j)

这不是给我想要的答案。如果我使用

contacts = re.findall("\d{3} \d{3}", text)

它给我前六位数字。每当我给-\d{4}时，它都无法正常工作。请帮忙。我需要10位数的联系电话。

文字样本

Ivor DelgadoAp＃310-1678 Ut Av.Santa Barbara MT 88317（689）721-5145 Pascale PattonP.O。方框399 4275 Amet StreetWest Allis NC 36734（676） 334-2174 Nasim StrongAp＃630-3889 Nulla。 StreetWatervliet俄克拉荷马州 70863（437）994-5270 Keaton UnderwoodAp＃636-8082 Arcu AvenueThiensville Maryland 19587（564）908-6970 Keegan BlairAp ＃761-2515 Egestas。 Rd.Manitowoc TN 07528（577）333-6244 Tamara Howe3415 Lobortis。 AvenueRocky Mount WA 48580（655）840-6139＆＃39;＆＃39;＆＃39;

Answer 1

您没有匹配大括号，而您的正则表达式只会搜索111 222-3333之类的数字。我认为你可以用这个

工作

>>> text = "Keaton UnderwoodAp #636-8082 Arcu AvenueThiensville Maryland 19587(564) 908-6970 Keegan BlairAp #761-2515 Egestas. Rd.Manitowoc TN 07528(577) 333-6244 Tamara Howe3415 Lobortis. AvenueRocky Mount WA 48580(655) 840-6139"
>>> re.findall("\(\d{3}\) \d{3}-\d{4}", text)
['(564) 908-6970', '(577) 333-6244', '(655) 840-6139']

从html中提取联系人信息

1 个答案: