POSIX正则表达式匹配html中的链接<a> tag

时间:2017-03-11 01:50:20

标签: c regex posix

I'm using the regex.h library for my C program.

I need to download all files whose link is stored in tag in html data. So my first task is to extract its contents of "href" property.

I use this address to pactice http://students.iitk.ac.in/programmingclub/course/lectures/

在其html内容中,有许多标签,如

<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.pdf">
<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.ppt">
<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.pptx">
...

我写了一个正则表达式字符串来提取“href”属性

中的内容
char regex[] = "href=\"([a-zA-Z0-9%.,]*\\.[a-zA-Z0-9]*{1,4})\"";

我对正则表达式的期望(我可以自己处理完全匹配和组匹配)。

1.%20Introduction%20to%20C%20language%20and%20Linux.pdf
1.%20Introduction%20to%20C%20language%20and%20Linux.ppt
1.%20Introduction%20to%20C%20language%20and%20Linux.pptx
...

我收到的只是第一个链接(我只关心群组匹配)。

1.%20Introduction%20to%20C%20language%20and%20Linux.pdf

美好的一天,非常感谢你。

ps:我对regcomp()使用REG_EXTENDED

0 个答案:

没有答案