我正在尝试将https://m.the-numbers.com/market/2018/top-grossing-movies,特别是表格刮成CSV。我正在使用Python和Beautiful Soup,但是我对此很陌生,并且喜欢任何解决方案的技巧。解决此问题的一些简单方法是什么?
谢谢
这是我下面的最新实验...
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['filmTitle', 'releasDate', 'distributor', 'genre', 'gross', 'ticketsSold'])
for tbody in soup.find_all('a', class_='table-responsive'):
filmTitle = tbody.tr.td.b.a.text
print(filmTitle)
csv_writer.writerow([filmTitle])
csv_file.close()
答案 0 :(得分:2)
假设您已经拥有source
的值,则可以执行以下操作:
import pandas as pd
df = pd.read_html(source)[0]
df.to_csv('cms_scrape.csv', index=False)
答案 1 :(得分:1)
像下面的代码一样可以完成工作。
关于该主题的有用链接:
import requests
from bs4 import BeautifulSoup
import csv
# Making get request
r = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', id='page_filling_chart').find('div', class_='table-responsive').find('table')
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
movies = []
for tr in table_soup.find_all('tr')[1:-2]:
tds = tr.find_all('td')
# Creating dict for each row and appending it to the movies list
movies.append({
'rank': tds[0].text.strip(),
'movie': tds[1].text.strip(),
'release_date': tds[2].text.strip(),
'distributor': tds[3].text.strip(),
'genre': tds[4].text.strip(),
'gross': tds[5].text.strip(),
'tickets_sold': tds[6].text.strip(),
})
# Writing movies list of dicts to file using csv.DictWriter
with open('movies.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=movies[0].keys())
writer.writeheader()
writer.writerows(movies)