Question

我对Python有点新鲜，我对如何从文本文件中提取和组织某些单词有一些疑问。例如，我制作了一个文本文件来说明：

5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California

所以，在这种情况下，我想只打印日期，天气状况和状态，同时忽略城市和数字以及按州组织，我想知道我将如何做到这一点。

就个人而言，我想做一个.split（''）函数虽然我认为不会起作用，因为最后一行有6个单词，而其他有5个单词。我也想过可能会创建一个集合以便可能由国家组织？我对这个过程仍然有点困惑。谢谢。

编辑：这就是我现在所拥有的。所以这确实会返回我想要的特定单词。

file = open('word.txt')
for line in file:
    weather = line.split(' ')[1]
    date = line.split(' ')[2]
    state = line.split(' ')[-1]


print(weather)
print(date)
print(state)

编辑2：这是我在该组织的尝试。但是，它不太有用。

file = open('word.txt')
    for line in file:
        weather = line.split(' ')[1]
        date = line.split(' ')[2]
        state = line.split(' ')[-1]


        setlist1 = []
        setlist2 = []

        if state == state:
            setlist2.append(state)        
            setlist1.append(date)
            setlist1.append(weather)
            setlist2.append(setlist1)

        print(setlist2)

Answer 1

我会使用正则表达式。您可以使用命名正则表达式，它允许您以简洁明了的方式访问每个组。

以下是一个例子：

Test.py：

#!/usr/bin/env python3
import re

pattern = '^(?P<value>[0-9\.]+) '
pattern += '(?P<weather>[a-zA-Z]+) '
pattern += '(?P<date>[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}) '
pattern += '(?P<location>[a-zA-Z\ ]+)$'
matches = []
regex =re.compile(pattern)

with open('text', 'r') as fh:
    for line in fh:
        matches.append(regex.match(line))

使用样本数据：

$ charlie on macbook in ~
❯❯ cat text
5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California

以交互方式运行时，您可以看到它与每个测试用例匹配。

$ charlie on macbook in ~
❯❯ python3 -i test.py
>>> for match in matches:
...   print(match.groups())
...
('5.8', 'Sunny', '01/23/2016', 'Seattle Washington')
('25.7', 'Cloudy', '03/04/2016', 'Chicago Illinois')
('7', 'Snowy', '12/20/2016', 'Tacoma Washington')
('3', 'Windy', '04/5/2016', 'Los Angeles California')
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
...   print('match[{}]: {}'.format(group, matches[0].group(group)))
...
match[value]: 5.8
match[weather]: Sunny
match[date]: 01/23/2016
match[location]: Seattle Washington
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
...   print('match[{}]: {}'.format(group, matches[1].group(group)))
...
match[value]: 25.7
match[weather]: Cloudy
match[date]: 03/04/2016
match[location]: Chicago Illinois
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
...   print('match[{}]: {}'.format(group, matches[2].group(group)))
...
match[value]: 7
match[weather]: Snowy
match[date]: 12/20/2016
match[location]: Tacoma Washington
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
...   print('match[{}]: {}'.format(group, matches[3].group(group)))
...
match[value]: 3
match[weather]: Windy
match[date]: 04/5/2016
match[location]: Los Angeles California
>>>

从这里，您可以轻松地组织您想要的数据。假设您想要从阳光充足的日子收集所有数据。

如果我们在文件中添加更多行以为其提供更多数据，并添加一个允许我们按组打印数据的功能，我们可以做更好的分析：

〜/文本：

5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California
31.3 Sunny 04/25/2016 Chicago Illinois
1.3 Sunny 04/25/2016 Seattle Washington
13 Sunny 04/25/2016 Indianapolis Indiana
33 Sunny 04/25/2016 Buffalo New York
1.3 Sunny 04/5/2016 Chicago Illinois
3.3 Sunny 04/25/2016 Tacoma Washington
1.2 Sunny 07/5/2016 Madison Wisconsin
31 Sunny 08/25/2016 Milwaukee Wisconsin
35 Sunny 08/29/2016 Chicago Illinois
5.1 Sunny 11/2/2016 Chicago Illinois
4 Sunny 11/6/2016 Sanwich Illinois
9 Sunny 11/16/2016 Portland Oregons
7 Sunny 11/29/2016 Washington DC
3.2 Sunny 12/10/2016 St Louis Missouri
3.5 Sunny 12/25/2016 Flint Michigan
4.7 Sunny 12/29/2016 Detroit Michigan

〜/ test.py：

#!/usr/bin/env python3
import re

GROUPS = ('value','date','weather','location')

def print_data(matches, group):
    local_groups = list(set(GROUPS) - {group})
    print('Group: {}'.format(group))
    print('-'*80)
    line_structure = '{0:^25}|{1:^25}|{2:^25}'
    for match in matches:
        data = [
            match.group(local_groups[0]),
            match.group(local_groups[1]),
            match.group(local_groups[2])
        ]
        print(line_structure.format(*data))

pattern = '^(?P<value>[0-9\.]+) '
pattern += '(?P<weather>[a-zA-Z]+) '
pattern += '(?P<date>[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}) '
pattern += '(?P<location>[a-zA-Z\ ]+)$'
matches = []
regex = re.compile(pattern)

with open('text', 'r') as fh:
    for line in fh:
        matches.append(regex.match(line))

sunny_matches = []
for match in matches:
    if match.group('weather').lower() == 'sunny':
        sunny_matches.append(match)

print('Printing sunny weather:')
print('{}\n'.format('='*50))
print_data(sunny_matches, 'weather')

如果我们运行它，我们得到以下输出：

Printing sunny weather:
==================================================

Group: weather
--------------------------------------------------------------------------------
       01/23/2016        |   Seattle Washington    |           5.8
       04/25/2016        |    Chicago Illinois     |          31.3
       04/25/2016        |   Seattle Washington    |           1.3
       04/25/2016        |  Indianapolis Indiana   |           13
       04/25/2016        |    Buffalo New York     |           33
        04/5/2016        |    Chicago Illinois     |           1.3
       04/25/2016        |    Tacoma Washington    |           3.3
        07/5/2016        |    Madison Wisconsin    |           1.2
       08/25/2016        |   Milwaukee Wisconsin   |           31
       08/29/2016        |    Chicago Illinois     |           35
        11/2/2016        |    Chicago Illinois     |           5.1
        11/6/2016        |    Sanwich Illinois     |            4
       11/16/2016        |    Portland Oregons     |            9
       11/29/2016        |      Washington DC      |            7
       12/10/2016        |    St Louis Missouri    |           3.2
       12/25/2016        |     Flint Michigan      |           3.5
       12/29/2016        |    Detroit Michigan     |           4.7

Answer 2

不是调用拆分3次，而是调用一次并将结果存储在变量

中

file = open('word.txt')
for line in file:
    res = line.split()
    weather = res[1]
    date = res[2]
    state = res[-1]

Answer 3

您走在正确的轨道上 - 将数据组织到单个词典中可能更容易。

import operator
get_data = operator.itemgetter(1, 2, -1)
result = []
with open('file.txt') as f:
    for line in f:
        d = {}
        line= line.strip()
        line = line.split()
        weather, date, state = get_data(line.split())
        d['weather'] = weather
        d['date'] = date
        d['state'] = state
        result.append(d)

或者如果你想保留这个城市，只需将每一行拆分三次

import operator
get_data = operator.itemgetter(1, 2, -1)
result = []
with open('file.txt') as f:
    for line in f:
        d = {}
        line= line.strip()
        line = line.split(maxsplit = 3)
        weather, date, city = get_data(line)
        d['weather'] = weather
        d['date'] = date
        d['city'] = city
        result.append(d)

在Python中组织和打印文本文件中的某些单词

3 个答案:

Test.py：

〜/文本：

〜/ test.py：