如何在Pandas中将不规则文本文件作为数据框读取

时间:2016-07-06 16:55:24

标签: python parsing pandas

My file contains texts in the following format:
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .

string1和string2不包含空格,而每个空格后只有一个空格。双引号内的文本也包含单个空格。

我无法直接将pd.read_csv()与sep = " "一起使用,因为在这种情况下,段落被分成不规则列。

有没有办法将这样的文件解析为数据帧。也许使用正则表达式的东西。

谢谢

以下是pd.read_csv(file_name, sep = " ")工作的数据的前4行,而下面是数据的顶行,其中相同的代码不是。我知道我可以使用rdflib将其作为输入读取并继续,但我使用pandas的目的是我只需要在这里做非常基本的添加/替换列。

<http://dbpedia.org/resource/Animalia_(book)> <http://www.w3.org/2000/01/rdf-schema#comment> "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket."@en .
<http://dbpedia.org/resource/Assistive_technology> <http://www.w3.org/2000/01/rdf-schema#comment> "Assistive technology is an umbrella term that includes assistive, adaptive, and rehabilitative devices for people with disabilities and also includes the process used in selecting, locating, and using them. Assistive technology promotes greater independence by enabling people to perform tasks that they were formerly unable to accomplish, or had great difficulty accomplishing, by providing enhancements to, or changing methods of interacting with, the technology needed to accomplish such tasks."@en .
<http://dbpedia.org/resource/A> <http://www.w3.org/2000/01/rdf-schema#comment> "A (named a /ˈeɪ/, plural aes) is the 1st letter and the first vowel in the ISO basic Latin alphabet. It is similar to the Ancient Greek letter alpha, from which it derives.  The upper-case version consists of the two slanting sides of a triangle, crossed in the middle by a horizontal bar. The lower-case version can be written in two forms: the double-storey a and single-storey ɑ. The latter is commonly used in handwriting and fonts based on it, especially fonts intended to be read by children."@en .
<http://dbpedia.org/resource/Aristotle> <http://www.w3.org/2000/01/rdf-schema#comment> "Aristotle (/ˈærɪˌstɒtəl/; Greek: Ἀριστοτέλης [aristotélɛːs], Aristotélēs; 384 – 322 BC) was a Greek philosopher and scientist born in the Macedonian city of Stagira, Chalkidice, on the northern periphery of Classical Greece. His father, Nicomachus, died when Aristotle was a child, whereafter Proxenus of Atarneus became his guardian. At eighteen, he joined Plato's Academy in Athens and remained there until the age of thirty-seven (c. 347 BC)."@en .

以下给出了不规则的阅读:

<http://dbpedia.org/resource/Big_Sounds_of_the_Drags> <http://www.w3.org/2000/01/rdf-schema#comment> "Big Sounds of the Drags is the second album by electronic music producer Junkie XL.\"Check Your Basic Groove\" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en .
<http://dbpedia.org/resource/Sydney_Roosters_Juniors> <http://www.w3.org/2000/01/rdf-schema#comment> "The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en .
<http://dbpedia.org/resource/A_Shot_at_Glory> <http://www.w3.org/2000/01/rdf-schema#comment> "A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en .
<http://dbpedia.org/resource/Kumar_Ponnambalam> <http://www.w3.org/2000/01/rdf-schema#comment> "Kumar Ponnambalam (August 12, 1940 – January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en .
<http://dbpedia.org/resource/Amalia_Mendoza> <http://www.w3.org/2000/01/rdf-schema#comment> "Amalia Mendoza García (10 July 1923 – 11 June 2001), nicknamed La Tariácuri, was a Mexican singer and actress. \"Échame a mi la culpa\" and \"Amarga navidad\" were some of her greatest hits."@en .

3 个答案:

答案 0 :(得分:1)

带有反斜杠的read_csv()作为转义字符实际上对我的数据样本都有效:

df = pd.read_csv("input.txt", sep=" ", header=None, escapechar="\\").iloc[:, :-1]
print(df)

列切片只是为了避免最后一列只包含点。

答案 1 :(得分:0)

即使在您标记为给出不规则读数的情况下,这也适用。

import re
import pandas as pd

col1 = []
col2 = []
col3 = []
with open('input.txt', 'r') as f:
    for line in f:
        g = re.match(r'^<(.*)> <(.*)> "(.*)"', line).groups()
        col1.append(g[0])
        col2.append(g[1])
        col3.append(g[2])

df = pd.DataFrame({'col1': col1, 'col2': col2, 'col3': col3})

答案 2 :(得分:0)

是的,我错过了什么吗?为什么不使用split()

# -*- coding: utf-8 -*-
sample = """\
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .
<string1> <string2> "some text as a paragraph" .""".splitlines()

sample = """\
<http://dbpedia.org/resource/Big_Sounds_of_the_Drags> <http://www.w3.org/2000/01/rdf-schema#comment> "Big Sounds of the Drags is the second album by electronic music producer Junkie XL.\"Check Your Basic Groove\" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en .
<http://dbpedia.org/resource/Sydney_Roosters_Juniors> <http://www.w3.org/2000/01/rdf-schema#comment> "The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en .
<http://dbpedia.org/resource/A_Shot_at_Glory> <http://www.w3.org/2000/01/rdf-schema#comment> "A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en .
<http://dbpedia.org/resource/Kumar_Ponnambalam> <http://www.w3.org/2000/01/rdf-schema#comment> "Kumar Ponnambalam (August 12, 1940 – January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en .
<http://dbpedia.org/resource/Amalia_Mendoza> <http://www.w3.org/2000/01/rdf-schema#comment> "Amalia Mendoza García (10 July 1923 – 11 June 2001), nicknamed La Tariácuri, was a Mexican singer and actress. \"Échame a mi la culpa\" and \"Amarga navidad\" were some of her greatest hits."@en .""".splitlines()

data = [s.split(None,2) for s in sample]

for d in data:
    print(d)

给出:

['<string1>', '<string2>', '"some text as a paragraph" .']
['<string1>', '<string2>', '"some text as a paragraph" .']
['<string1>', '<string2>', '"some text as a paragraph" .']
['<string1>', '<string2>', '"some text as a paragraph" .']

['<http://dbpedia.org/resource/Big_Sounds_of_the_Drags>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Big Sounds of the Drags is the second album by electronic music producer Junkie XL."Check Your Basic Groove" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en .']
['<http://dbpedia.org/resource/Sydney_Roosters_Juniors>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en .']
['<http://dbpedia.org/resource/A_Shot_at_Glory>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en .']
['<http://dbpedia.org/resource/Kumar_Ponnambalam>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Kumar Ponnambalam (August 12, 1940 \x96 January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en .']
['<http://dbpedia.org/resource/Amalia_Mendoza>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Amalia Mendoza Garc\xeda (10 July 1923 \x96 11 June 2001), nicknamed La Tari\xe1curi, was a Mexican singer and actress. "\xc9chame a mi la culpa" and "Amarga navidad" were some of her greatest hits."@en .']

要从输入文件加载数据,请使用:

with open('big_honking_file.dat') as sample:
    data = [s.split(None,2) for s in sample]

(这只会从输入文件中一次读取一行,因此您不会在内存中获得整个数据集的两个副本,只有一个。)

为了便于对此列表进行行操作,请查看littletablehttps://pypi.python.org/pypi/littletable/) - 它可能比使用pandas更轻量级。

from littletable import Table
data = Table()
with open('sample.txt') as sample:
    data.insert_many(s.split(None,2) for s in sample)