我有一个html文件,有大量的相关href链接,如;
href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014/a>br/>
文件中有大量其他http和ftp链接,
我需要一个输出txt文件;
14/02/08: station1_140208.txt
14/02/09: station1_140209.txt
14/02/10: station1_140210.txt
14/02/11: station1_140211.txt
14/02/12: station1_140212.txt
我试着自己编写,但是我需要很长时间才能习惯Python正则表达式 我可以打开源文件,应用一个我想不通的特定正则表达式,并将其写回磁盘。
我需要你在正则表达方面的帮助。 感谢。
答案 0 :(得分:2)
我知道这并不是你要求的,但我想我会展示一种方法,将日期从你的链接文本转换为你在所需输出的例子中显示的格式(dd / mm / yy)。我使用BeautifulSoup来读取html中的元素。
from bs4 import BeautifulSoup
import datetime as dt
import re
html = '<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a><br/>'
p = re.compile(r'.*/station1_\d+\.txt')
soup = BeautifulSoup(html)
a_tags = soup.find_all('a', {"href": p})
>>> print a_tags # would be a list of all a tags in the html with relevant href attribute
[<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>]
names = [str(a.get('href')).split('/')[-1] for a in a_tags] #str because they will be in unicode
dates = [dt.datetime.strptime(str(a.text), '%A, %B %m, %Y') for a in a_tags]
姓名和日期使用list comprehensions
strptime从日期字符串
中创建日期时间对象>>> print names # would be a list of all file names from hrefs
['station1_140208.txt']
>>> print dates # would be a list of all dates as datetime objects
[datetime.datetime(2014, 8, 1, 0, 0)]
toFileData = ["{0}: {1}".format(dt.datetime.strftime(d, '%w/%m/%y'), n) for d in dates for n in names]
strftime将日期重新格式化为示例中的格式:
>>> print toFileData
['5/08/14: station1_140208.txt']
然后将toFileData
中的条目写入文件
有关我在上面的代码中使用的方法的信息,例如soup.find_all()
和a.get()
,我建议您通过顶部的链接查看BeautifulSoup
文档。希望这会有所帮助。
答案 1 :(得分:0)
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
试验:
import re
s = """
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
br/>
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
br/>
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
br/>
"""
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
re.findall(pattern,s)
输出:
[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]