无法使用正则表达式来获取特定数据

时间:2016-06-20 19:01:32

标签: regex python-2.7 web-scraping

所以这是我的RegEx:

re.findall(r'(?<=data-author=")(.*)(?=" data-author-fullname)'

我正在尝试提取用户名,在这种情况下:“zCourge_idx”,但由于某种原因,我的正则表达式会选择所有内容,直到下一个“data-author-fullname”实例我可以在必要时包含更多信息

"zCourge_iDX" data-author-fullname="t2_6ups9" ><p class="parent"><a name="d4gqsup"></a></p><div class="midcol unvoted" ><div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0" ></div><div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0" ></div></div><div class="entry unvoted"><p class="tagline"><a href="javascript:void(0)" class="expand" onclick="return togglecomment(this)">[–]</a><a href="https://www.reddit.com/user/zCourge_iDX" class="author may-blank id-t2_6ups9" >zCourge_iDX</a><span class="userattrs"></span>&#32;<span class="score dislikes">0 points</span><span class="score unvoted">1 point</span><span class="score likes">2 points</span>&#32;<time title="Mon Jun 20 15:50:56 2016 UTC" datetime="2016-06-20T15:50:56+00:00" class="live-timestamp">12 minutes ago</time>&nbsp;<a href="javascript:void(0)" class="numchildren" onclick="return togglecomment(this)">(2 children)</a></p><form action="#" class="usertext warn-on-unload" onsubmit="return post_form(this, 'editusertext')" id="form-t1_d4gqsupk6x"><input type="hidden" name="thing_id" value="t1_d4gqsup"/><div class="usertext-body may-blank-within md-container "><div class="md"><p>Have you seen the box office reports?</p>
</div>
</div></form><ul class="flat-list buttons"><li class="first"><a href="https://www.reddit.com/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/d4gqsup" data-event-action="permalink" class="bylink" rel="nofollow" >permalink</a></li><li><a href="javascript:void(0)" data-comment="/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/d4gqsup" data-media="www.redditmedia.com" data-link="/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/" data-root="false" data-title="Warcraft is now the biggest video game movie of all-time" class="embed-comment" >embed</a></li><li class="comment-save-button save-button"><a href="javascript:void(0)">save</a></li><li><a href="#d4gqo0n" data-event-action="parent" class="bylink" rel="nofollow" >parent</a></li><li class="report-button"><a href="javascript:void(0)" class="reportbtn access-required" data-event-action="report">report</a></li><li class="give-gold-button"><a href="/gold?goldtype=gift&months=1&thing=t1_d4gqsup" title="give reddit gold in appreciation of this post." class="give-gold login-required access-required" data-event-action="gild" >give gold</a></li><li class="reply-button"><a class="access-required" href="javascript:void(0)" data-event-action="comment" onclick="return reply(this)">reply</a></li></ul><div class="reportform report-t1_d4gqsup"></div></div><div class="child"><div id="siteTable_t1_d4gqsup" class="sitetable listing"><div class=" thing id-t1_d4gqxdy noncollapsed &#32; comment " id="thing_t1_d4gqxdy" onclick="click_thing(this)" data-fullname="t1_d4gqxdy" data-type="comment" data-subreddit="movies" data-subreddit-fullname="t5_2qh3s" data-author="Serialdan </b>

1 个答案:

答案 0 :(得分:0)

正如Oren在评论中提到的,目前还不清楚为什么你在正则表达式中使用lookbehind。

尝试

 re.findall('.*"(.+?)"\s+data-author-fullname', string)

非贪婪的匹配将获取用户名,但我仍然建议您使用除正则表达式之外的其他内容来解析HTML,喜欢机械化,美丽的等等。