I'm web scraping from a local archive of techcrunch.com. I'm using regex to sort through and grab every heading for each article, however my output continues to remain as the last occurrence.
def extractNews():
selection = listbox.curselection()
if selection == (0,):
# Read the webpage:
response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
html = response.read()
match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )
if match:
for link, title in match:
variable = "%s" % (title)
print(variable)
and the current output is
Heetch raises $12 million to reboot its ridesharing service
which is the last heading of the entire webpage, as seen in the image below (last occurrence)
The website/image looks like this and each article block consists of the same code for the heading:
<h2 class="post-title"><a href="https://web.archive.org/web/20171001000310/https://techcrunch.com/2017/09/29/heetch-raises-12-million-to-reboot-its-ride-sharing-service/" data-omni-sm="gbl_river_headline,20">Heetch raises $12 million to reboot its ridesharing service</a></h2>
I cannot see why it keeps resulting to this last match. I have ran it through websites such as https://regex101.com/ and it tells me that I only have one match which is not the one being outputted in my program. Any help would be greatly appreciated.
EDIT: If anyone is aware of a way to display each matched result SEPARATELY between different <h1></h1>
tags when writing to a .html file, it would mean a lot :) I am not sure if this is right but I think you use [-#] for the position/match being referred too?
答案 0 :(得分:0)
正则表达式很好,但你的问题在这里循环。
if match:
for link, title in match:
variable = "%s" % (title)
每次迭代都会覆盖您的变量。这就是为什么你只看到循环的最后一次迭代的值。
你可以沿着这些方向做点什么:
if match:
variableList = []
for link, title in match:
variable = "%s" % (title)
variableList.append(variable)
print variableList
另外,一般来说,我建议不要使用正则表达式来解析html(根据famous answer)。
如果您还没有熟悉BeautifulSoup,那么您应该这样做。这是一个非正则表达式解决方案,使用BeautifulSoup从您的html页面中挖掘所有h2后标题。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})