我目前正在学习大数据课程,但对此并不了解。对于作业,我想了解哪些主题在关于阿姆斯特丹的TripAdvisor论坛上进行了讨论。我想创建一个CSV文件,包括主题,作者和每个主题的回复数量。一些问题:
'onclick="setPID(34603)'
后面,并以</a>
结尾。我试过'(re.findall(r'onclick="setPID(34603)">(.*?)</a>'
,发帖)'但是它不起作用。这是我的代码:
from urllib import request
import re
import csv
topiclist=[]
metalist=[]
req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60-
Amsterdam_North_Holland_Province.html', headers={'User-Agent' :
"Mozilla/5.0"})
tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")
topicsection=re.findall(r'<b><a(.*?)</div>',tekst)
topic=[]
for post in topicsection:
topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)
author=[]
for post in topicsection:
author.append(re.findall(r'<a href="/members-forums/.*?">(.*?)</a>',
post))
replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)
答案 0 :(得分:3)
Don't use regular expressions to parse HTML.使用html解析器,例如beautifulsoup。
例如 -
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.tripadvisor.com/ShowForum-g188590-i60-Amsterdam_North_Holland_Province.html")
soup = BeautifulSoup(r.content, "html.parser") #or another parser such as lxml
topics = soup.find_all("a", {'onclick': 'setPID(34603)'})
#do stuff