熊猫csv阅读器产生错误结果

时间:2020-03-30 02:59:20

标签: pandas

我有一个生成错误日期格式的python脚本。

import csv
import urllib
import requests
import numpy as np
from urllib.request import urlopen
from matplotlib.dates import DateFormatter
import matplotlib.pyplot as plt 
import pandas as pd
import io

link = 'https://health-infobase.canada.ca/src/data/covidLive/covid19.csv'
s = requests.get(link).content
coviddata = pd.read_csv(io.StringIO(s.decode('utf-8')),
                        parse_dates=['date'],
                        index_col= ['date'],
                        na_values=['999.99'])
prinput = 'Quebec'
ispr = coviddata['prname'] == prinput
covidpr = coviddata[ispr]
print(covidpr)

它产生的数据似乎使日期混乱,如下所示。

        pruid  prname prnameFR  ...  numtotal  numtoday  numtested

日期... 2020-01-03 24魁北克魁北克省... 1 1 NaN 2020-03-03 24魁北克魁北克省... 1 0 NaN 2020-05-03 24魁北克魁北克省... 2 1 NaN 2020-06-03 24魁北克魁北克省... 2 0 NaN 2020-07-03 24魁北克魁北克省... 2 0 NaN 2020-08-03 24魁北克魁北克省... 3 1 NaN 2020-09-03 24魁北克魁北克省... 4 1 NaN 2020-11-03 24魁北克魁北克省... 7 3 NaN 2020-12-03 24魁北克魁北克省... 13 6 NaN 2020-03-13 24魁北克魁北克省... 17 4 NaN 2020-03-14 24魁北克魁北克省... 17 0 NaN

现在相反 这是另一个有效的代码段。

import csv
import urllib
import requests
from urllib.request import urlopen
from matplotlib.dates import DateFormatter
import matplotlib.pyplot as plt 
from datetime import datetime
link = 'https://health-infobase.canada.ca/src/data/covidLive/covid19.csv'

text = requests.get(link).text
lines = text.splitlines()
infile = csv.DictReader(lines)
prinput = input("Enter province(EN):")
xvalues=[]
yvalues=[]
for row in infile:
    if(row['prname']==prinput):
    xvalues.append(row['date'])
    yvalues.append(row['numconf'])
    print(row['prname'],row['date'],row['numconf'])

它产生正确的日期 魁北克01-03-2020 1 魁北克03-03-2020 1 魁北克05-03-2020 2 魁北克06-03-2020 2 魁北克07-03-2020 2 魁北克08-03-2020 3 魁北克09-03-2020 4 魁北克11-03-2020 7 魁北克12-03-2020 13 魁北克13-03-2020 17 魁北克14-03-2020 17 魁北克15-03-2020 24 魁北克16-03-2020 39 魁北克17-03-2020 50

第一个脚本有什么问题?

1 个答案:

答案 0 :(得分:0)

由于使用了parse_dates属性,因此pandas将“ date”列解释为日期时间对象。这对于在一段时间内绘制数据或在给定时间段内对数据重新采样非常有用。如果要重组日期时间格式以打印数据集,可以使用日期时间序列的dt.strftime属性来进行。 即

# Import pandas
import pandas as pd

# Read in dataframe from url
covid_df = pd.read_csv("https://health-infobase.canada.ca/src/data/covidLive/covid19.csv", 
                       parse_dates=['date'], na_values=[999.99])

# Create new column date-str that's the string interpretation of the 'date' column
covid_df['date-str'] = covid_df['date'].dt.strftime("%d-%m-%Y")

# Show the top of the dataframe
covid_df.head()
"""
   pruid            prname              prnameFR       date  ...  numtotal  numtoday  numtested    date-str
0     35           Ontario               Ontario 2020-01-31  ...         3         3        NaN  31-01-2020
1     59  British Columbia  Colombie-Britannique 2020-01-31  ...         1         1        NaN  31-01-2020
2      1            Canada                Canada 2020-01-31  ...         4         4        NaN  31-01-2020
3     35           Ontario               Ontario 2020-08-02  ...         3         0        NaN  02-08-2020
4     59  British Columbia  Colombie-Britannique 2020-08-02  ...         4         3        NaN  02-08-2020
"""

# Show dtypes and properties of each column of the dataframe
covid_df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   pruid      302 non-null    int64         
 1   prname     302 non-null    object        
 2   prnameFR   302 non-null    object        
 3   date       302 non-null    datetime64[ns]
 4   numconf    302 non-null    int64         
 5   numprob    302 non-null    int64         
 6   numdeaths  302 non-null    int64         
 7   numtotal   302 non-null    int64         
 8   numtoday   302 non-null    int64         
 9   numtested  0 non-null      float64       
 10  date-str   302 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(6), object(3)
memory usage: 26.1+ KB
"""