Question

I'm using the regex.h library for my C program.

I need to download all files whose link is stored in tag in html data. So my first task is to extract its contents of "href" property.

I use this address to pactice http://students.iitk.ac.in/programmingclub/course/lectures/

在其html内容中，有许多标签，如

<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.pdf">
<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.ppt">
<a href="1.%20Introduction%20to%20C%20language%20and%20Linux.pptx">
...

我写了一个正则表达式字符串来提取“href”属性

中的内容

char regex[] = "href=\"([a-zA-Z0-9%.,]*\\.[a-zA-Z0-9]*{1,4})\"";

我对正则表达式的期望（我可以自己处理完全匹配和组匹配）。

1.%20Introduction%20to%20C%20language%20and%20Linux.pdf
1.%20Introduction%20to%20C%20language%20and%20Linux.ppt
1.%20Introduction%20to%20C%20language%20and%20Linux.pptx
...

我收到的只是第一个链接（我只关心群组匹配）。

1.%20Introduction%20to%20C%20language%20and%20Linux.pdf

美好的一天，非常感谢你。

ps：我对regcomp（）使用REG_EXTENDED

POSIX正则表达式匹配html中的链接<a> tag

0 个答案: