如何使用python和漂亮的汤刮这张桌子?

时间:2019-10-28 22:44:53

标签: python web-scraping beautifulsoup

我正在尝试将https://m.the-numbers.com/market/2018/top-grossing-movies,特别是表格刮成CSV。我正在使用Python和Beautiful Soup,但是我对此很陌生,并且喜欢任何解决方案的技巧。解决此问题的一些简单方法是什么?

谢谢

这是我下面的最新实验...

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('cms_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['filmTitle', 'releasDate', 'distributor', 'genre', 'gross', 'ticketsSold'])

for tbody in soup.find_all('a', class_='table-responsive'):

    filmTitle = tbody.tr.td.b.a.text
    print(filmTitle)

    csv_writer.writerow([filmTitle])

csv_file.close()

2 个答案:

答案 0 :(得分:2)

假设您已经拥有source的值,则可以执行以下操作:

import pandas as pd
df = pd.read_html(source)[0]
df.to_csv('cms_scrape.csv', index=False)

答案 1 :(得分:1)

像下面的代码一样可以完成工作。

关于该主题的有用链接:

import requests
from bs4 import BeautifulSoup
import csv

# Making get request
r = requests.get('https://m.the-numbers.com/market/2018/top-grossing-movies')

# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')

# Localizing table from the BS object
table_soup = soup.find('div', id='page_filling_chart').find('div', class_='table-responsive').find('table')

# Iterating through all trs in the table except the first(header) and the last two(summary) rows
movies = []
for tr in table_soup.find_all('tr')[1:-2]:
    tds = tr.find_all('td')

    # Creating dict for each row and appending it to the movies list
    movies.append({
        'rank': tds[0].text.strip(),
        'movie': tds[1].text.strip(),
        'release_date': tds[2].text.strip(),
        'distributor': tds[3].text.strip(),
        'genre': tds[4].text.strip(),
        'gross': tds[5].text.strip(),
        'tickets_sold': tds[6].text.strip(),
    })

# Writing movies list of dicts to file using csv.DictWriter
with open('movies.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=movies[0].keys())
    writer.writeheader()
    writer.writerows(movies)