我要保存这两页中的电影评论和电影标题。
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = network(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels.float().argmax(dim=1)).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
当我运行此代码并打开csv文件时。
https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=~
https://movie.naver.com/movie/bi/mi/basic.nhn?code=~
如何解决此代码?
答案 0 :(得分:1)
您可以尝试清理的一件事是首先转换为字符串,然后基于html放置约束,如下所示:
title = str(soup.find('h3', 'h_movie'))
start = '" title="'
end = ' , 2018">'
newTitle = title[title.find(start)+len(start):title.rfind(end)]
然后在评论部分尝试相同的操作。您需要缩小结果集的范围,然后将其转换为评论部分所在的字符串,并在其上施加约束。
然后,您将清理数据并准备将其添加到DataFrame中。
希望这有助于您走上正确的道路!
答案 1 :(得分:0)
现在很干净了……只需删除标签,就像这样:
from bs4 import BeautifulSoup
from urllib.request import urlopen
#from selenium import webdriver
from urllib.request import urljoin
import pandas as pd
import requests
import re
#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']
df = pd.DataFrame()
for n in pages:
# Create url
url = base_url + n
url2 = base_url2 + n
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
reple = soup.find("span", {"id":re.compile("^_filtered_ment")}).getText()
res2 = requests.get(url2)
soup = BeautifulSoup(res2.text, "html.parser")
title = soup.find('h3', 'h_movie')
for a in title.find_all('a'):
#print(a.text)
title=a.text
data = {'title':[title], 'reviewn':[reple]}
df = df.append(pd.DataFrame(data))
df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')
我为regex类_filtered_ment_ *添加了import re