从wapo刮掉的推文出了差错

时间:2017-08-25 08:39:04

标签: python csv web-scraping

两个问题。

  1. 目标是一个包含日期,时间和列的标题的3列csv鸣叫。 我尝试从li中提取跨度文本/时间导致重复时间和推文列中的跨度信息。 这是我使用python的第一周,我试图用“”替换()推文列的“时间”,但我最终删除了两列“时间”实例。

  2. 将列按顺序组合在一起,或者在出现时将数据列正确混合在一起。我写的代码会产生30,000或1000行。正确的csv文件大约应该是520行。

  3. import bs4 as bs
    import requests, urllib.request, csv
    from urllib.request import urlopen
    
    
    sauce = urllib.request.urlopen('https://www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858').read()
    soup = bs.BeautifulSoup(sauce, 'html.parser')
    
    lists = soup.find_all('li', class_='visible')
    dates = soup.find_all("li", attrs={"data-date": True})
    
    tweet_data = ['date, time, tweets']
    
    for li in   dates[1:]:
        date = li['data-date']
        tweet_data.append([date])
    
    for list in lists[1:]:
        time = list.find_all('span', {"class": "gray"})[0].text
        tweets = list.text
        tweet_data.append([time, tweets])
    
    with open('tweets_attempt_8.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(tweet_data) 
    

2 个答案:

答案 0 :(得分:2)

以下是您需要的代码... 我希望你满意这个答案。

import bs4 as bs
import urllib2,csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

url='www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858'

sauce = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen(sauce)
data = con.read()
soup = bs.BeautifulSoup(data, 'html.parser')

lists = soup.find_all('li', class_='visible')
dates = soup.find_all("li", attrs={"data-date": True})

tweet_data = ['date, time, tweets']

for li,list in zip(dates[1:],lists[1:]):
    date = li['data-date']
    time = list.find_all('span', {"class": "gray"})[0].text
    tweets = list.text
    tweet_data.append([date,time, tweets])

with open('/tmp/tweets_attempt_8.csv', 'w') as csvfile:
     writer = csv.writer(csvfile)
     writer.writerows(tweet_data) 

As you want the Out Put look at this

答案 1 :(得分:0)

试试这个。该页面中有504行要解析。你将获得所有这些csv输出。

import csv ; import requests ; from bs4 import BeautifulSoup

with open('tweets_attempt_8.csv', 'w', newline='', encoding='utf8') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(['date','time','tweets'])

    sauce = requests.get('https://www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858',headers={"User-Agent":"Existed"}).text
    soup = BeautifulSoup(sauce,"html.parser")

    for item in soup.select("li.pg-excerpt.visible"):
        date = item.get('data-date')
        time = item.select("span.gray")[0].text
        title = item.text.strip()
        print(date, time, title[10:])
        writer.writerow([date, time, title[10:]])