Question

我想从文件中读取数据到 DataFrame 中。但是此文件是一种特殊格式。包含如此多的行：

year = [1, 2, 3]

age = [4, 5, 6]

这是指向特殊文件的链接：https://github.com/cuongpiger/Py-for-ML-DS-DV/blob/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt

Answer 1

如果需要所有<cfoutput> <cfdocument name="myBook" format="PDF"> <cfloop from="1" to="200" index="i"> <h1>"Gandhi"redirects here. For the third prime minister of India, see Indira Gandhi. For other uses, see Gandhi (disambiguation). MahātmāMohandas Karamchand GandhiStudio photograph of Mohandas K. Gandhi, London, 1931.Born Mohandas Karamchand Gandhi2 October 1869Porbandar, Kathiawar Agency, British-ruled IndiaDied 30 January 1948 (aged 78)New Delhi, IndiaCause of death Assassination (gunshot) Monuments Raj Ghat,Gandhi SmritiNationality IndianOther names Mahatma Gandhi, Bapu ji, Gandhi jiEducation Bachelor of LawsAlma mater University College London[1]Inner TempleOccupation LawyerPoliticianActivistWriterYears active 1893–1948Era British RajKnown for Indian Independence Movement,Nonviolent resistanceNotable work </h1> </cfloop> </cfdocument> </cfoutput> <cfpdf action="write" source="myBook" destination="res.pdf" overwrite="yes" saveoption="linear">值，请创建Series词典，然后将DataFrame传递给DataFrame构造函数以获取解析列表：

ast.literal_eval

仅使用2列：

import ast

d = {}
with open('dulieu_year_gap_pop_life.txt') as file:
    splitted = file.readlines()
    for x in splitted:
        h, data = x.strip().split(' = ')
        d[h] = pd.Series(ast.literal_eval(data))

df = pd.DataFrame(d)
print (df)
     year    pop       gdp_cap  life_exp  life_exp1950
0    1950   2.53    974.580338    43.828         28.80
1    1951   2.57   5937.029526    76.423         55.23
2    1952   2.62   6223.367465    72.301         43.08
3    1953   2.67   4797.231267    42.731         30.02
4    1954   2.71  12779.379640    75.320         62.48
..    ...    ...           ...       ...           ...
146  2096  10.81           NaN       NaN           NaN
147  2097  10.82           NaN       NaN           NaN
148  2098  10.83           NaN       NaN           NaN
149  2099  10.84           NaN       NaN           NaN
150  2100  10.85           NaN       NaN           NaN

[151 rows x 5 columns]

Answer 2

由于输入文件中列表的长度不同，因此不能将它们放在一个DataFrame中。对于前两个长度相同的列表，以下方法将起作用：

import requests

url = 'https://raw.githubusercontent.com/cuongpiger/Py-for-ML-DS-DV/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt'
response = requests.get(url)
a = response.content.decode('utf-8')
df = pd.DataFrame()
for i in a.splitlines()[:2]:
    df[i.split()[0]] = [x.replace(']','').replace('[','').replace(',','') for x in i.split()[2:]]

df
Out: 
     year    pop
0    1950   2.53
1    1951   2.57
2    1952   2.62
3    1953   2.67
4    1954   2.71
..    ...    ...
146  2096  10.81
147  2097  10.82
148  2098  10.83
149  2099  10.84
150  2100  10.85
[151 rows x 2 columns]

Answer 3

借助正则表达式：

import pandas as pd
import re

file = open('dulieu_year_gap_pop_life.txt','r')

# Empty Dataframe
df = pd.DataFrame()     

for line in file.readlines():
    group = re.match('(.*) = (.*)',line)
    df[group[1]] = pd.Series(eval(group[2]))

如何读取其中包含数组的文件.txt？

3 个答案: