如何提取包含匹配短语的文本文件中的行?

时间:2016-05-31 08:31:54

标签: python python-2.7

我有一个包含多行句子的文本语料库。我希望提取包含关键词的行。

我写了一个简单的python脚本,但我根本没有任何价值。

我的python脚本:

 corpus = []

with open('CatList2.text') as f:

    for line in f:
    corpus.append(line.rstrip())  

with open('Test.text') as f1:
    with open('Text', 'a') as f2:
    for line in f1.readlines():
    for phrase in corpus:
    if phrase in line:

    f2.write(line)

以下是wiki.en.text的一个例子:

Alluvium (from the Latin, alluvius, from alluere, "to wash against") is loose, unconsolidated (not cemented together into a solid rock) soil or sediments, which has been eroded, reshaped by water in some form, and redeposited in a non-marine setting
Geoarchaeology is a multi-disciplinary approach which uses the techniques and subject matter of geography, geology and other Earth sciences to examine topics which inform archaeological knowledge and thought. Geoarchaeologists study the natural physical processes that affect archaeological sites such as geomorphology, the formation of sites through geological processes and the effects on buried sites and artifacts post-deposition. Geoarchaeologists' work frequently involves studying soil and sediments as well as other geographical concepts to contribute an archaeological study. Geoarchaeologists may also use computer cartography, geographic information systems (GIS) and digital elevation models (DEM) in combination with disciplines from human and social sciences and earth sciences.[1] Geoarchaeology is important to society because it informs archaeologists about the geomorphology of the soil, sediments and the rocks on the buried sites and artifacts they're researching on. By doing this we are able locate ancient cities and artifacts and estimate by the quality of soil how "prehistoric" they really are.
A Geopark is a unified area that advances the protection and use of geological heritage in a sustainable way, and promotes the economic well-being of the people who live there.[1] There are Global Geoparks and National Geoparks.
Spatial analysis or spatial statistics includes any of the formal techniques which study entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques, many still in their early development, using different analytic approaches and applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is the technique applied to structures at the human scale, most notably in the analysis of geographic data.
Spatial mismatch is the mismatch between where low-income households reside and suitable job opportunities. In its original formulation (see below) and in subsequent research, it has mostly been understood as a phenomenon affecting African-Americans, as a result of residential segregation, economic restructuring, and the suburbanization of employment.
Distance decay is a geographical term which describes the effect of distance on cultural or spatial interactions. The distance decay effect states that the interaction between two locales declines as the distance between them increases. Once the distance is outside of the two locales' activity space, their interactions begin to decrease.
Cold is the presence of low temperature, especially in the atmosphere.[4] In common usage, cold is often a subjective perception. A lower bound to temperature is absolute zero, defined as 0.00 °K on the Kelvin scale, an absolute thermodynamic temperature scale. This corresponds to −273.15 °C on the Celsius scale, −459.67 °F on the Fahrenheit scale, and 0.00 °R on the Rankine scale.

我的CatList包含我的搜索短语如下:

Alluvium
Anatopism

我希望的结果是:

Alluvium (from the Latin, alluvius, from alluere, "to wash against") is loose, unconsolidated (not cemented together into a solid rock) soil or sediments, which has been eroded, reshaped by water in some form, and redeposited in a non-marine setting

由于只有CatList中包含的冲积分析也出现在Wiki.en.text

我不知道为什么我无法得到结果。请帮我。谢谢。

很奇怪我收到了这个错误:

Traceback (most recent call last):
  File "JRTry.py", line 2, in <module>
    phrases = open("Test.text").readLines()
AttributeError: 'file' object has no attribute 'readLines'

我上传了{Error while using '<file>.readlines()' function)并且我放了for line in f1.readlines():但它仍然给我一个错误,任何想法?

1 个答案:

答案 0 :(得分:0)

问题在于,当您从文件中读取关键字时,您也会获得换行符。

您可以使用rstrip将其删除(请参阅this SO post)。

Python解释器:

>>> with open("test") as f:
...     for line in f:
...             a.append(line)
... 
>>> a
['foo\n'] #see that there's a newline?

相反,请使用

a.append(line.rstrip()) #this will remove the newline