我对Python有点新鲜,我对如何从文本文件中提取和组织某些单词有一些疑问。例如,我制作了一个文本文件来说明:
5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California
所以,在这种情况下,我想只打印日期,天气状况和状态,同时忽略城市和数字以及按州组织,我想知道我将如何做到这一点。
就个人而言,我想做一个.split('')函数虽然我认为不会起作用,因为最后一行有6个单词,而其他有5个单词。我也想过可能会创建一个集合以便可能由国家组织?我对这个过程仍然有点困惑。谢谢。
编辑:这就是我现在所拥有的。所以这确实会返回我想要的特定单词。file = open('word.txt')
for line in file:
weather = line.split(' ')[1]
date = line.split(' ')[2]
state = line.split(' ')[-1]
print(weather)
print(date)
print(state)
编辑2:这是我在该组织的尝试。但是,它不太有用。
file = open('word.txt')
for line in file:
weather = line.split(' ')[1]
date = line.split(' ')[2]
state = line.split(' ')[-1]
setlist1 = []
setlist2 = []
if state == state:
setlist2.append(state)
setlist1.append(date)
setlist1.append(weather)
setlist2.append(setlist1)
print(setlist2)
答案 0 :(得分:1)
我会使用正则表达式。您可以使用命名正则表达式,它允许您以简洁明了的方式访问每个组。
以下是一个例子:
#!/usr/bin/env python3
import re
pattern = '^(?P<value>[0-9\.]+) '
pattern += '(?P<weather>[a-zA-Z]+) '
pattern += '(?P<date>[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}) '
pattern += '(?P<location>[a-zA-Z\ ]+)$'
matches = []
regex =re.compile(pattern)
with open('text', 'r') as fh:
for line in fh:
matches.append(regex.match(line))
使用样本数据:
$ charlie on macbook in ~
❯❯ cat text
5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California
以交互方式运行时,您可以看到它与每个测试用例匹配。
$ charlie on macbook in ~
❯❯ python3 -i test.py
>>> for match in matches:
... print(match.groups())
...
('5.8', 'Sunny', '01/23/2016', 'Seattle Washington')
('25.7', 'Cloudy', '03/04/2016', 'Chicago Illinois')
('7', 'Snowy', '12/20/2016', 'Tacoma Washington')
('3', 'Windy', '04/5/2016', 'Los Angeles California')
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
... print('match[{}]: {}'.format(group, matches[0].group(group)))
...
match[value]: 5.8
match[weather]: Sunny
match[date]: 01/23/2016
match[location]: Seattle Washington
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
... print('match[{}]: {}'.format(group, matches[1].group(group)))
...
match[value]: 25.7
match[weather]: Cloudy
match[date]: 03/04/2016
match[location]: Chicago Illinois
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
... print('match[{}]: {}'.format(group, matches[2].group(group)))
...
match[value]: 7
match[weather]: Snowy
match[date]: 12/20/2016
match[location]: Tacoma Washington
>>>
>>> for group in ('value', 'weather', 'date', 'location'):
... print('match[{}]: {}'.format(group, matches[3].group(group)))
...
match[value]: 3
match[weather]: Windy
match[date]: 04/5/2016
match[location]: Los Angeles California
>>>
从这里,您可以轻松地组织您想要的数据。假设您想要从阳光充足的日子收集所有数据。
如果我们在文件中添加更多行以为其提供更多数据,并添加一个允许我们按组打印数据的功能,我们可以做更好的分析:
5.8 Sunny 01/23/2016 Seattle Washington
25.7 Cloudy 03/04/2016 Chicago Illinois
7 Snowy 12/20/2016 Tacoma Washington
3 Windy 04/5/2016 Los Angeles California
31.3 Sunny 04/25/2016 Chicago Illinois
1.3 Sunny 04/25/2016 Seattle Washington
13 Sunny 04/25/2016 Indianapolis Indiana
33 Sunny 04/25/2016 Buffalo New York
1.3 Sunny 04/5/2016 Chicago Illinois
3.3 Sunny 04/25/2016 Tacoma Washington
1.2 Sunny 07/5/2016 Madison Wisconsin
31 Sunny 08/25/2016 Milwaukee Wisconsin
35 Sunny 08/29/2016 Chicago Illinois
5.1 Sunny 11/2/2016 Chicago Illinois
4 Sunny 11/6/2016 Sanwich Illinois
9 Sunny 11/16/2016 Portland Oregons
7 Sunny 11/29/2016 Washington DC
3.2 Sunny 12/10/2016 St Louis Missouri
3.5 Sunny 12/25/2016 Flint Michigan
4.7 Sunny 12/29/2016 Detroit Michigan
#!/usr/bin/env python3
import re
GROUPS = ('value','date','weather','location')
def print_data(matches, group):
local_groups = list(set(GROUPS) - {group})
print('Group: {}'.format(group))
print('-'*80)
line_structure = '{0:^25}|{1:^25}|{2:^25}'
for match in matches:
data = [
match.group(local_groups[0]),
match.group(local_groups[1]),
match.group(local_groups[2])
]
print(line_structure.format(*data))
pattern = '^(?P<value>[0-9\.]+) '
pattern += '(?P<weather>[a-zA-Z]+) '
pattern += '(?P<date>[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}) '
pattern += '(?P<location>[a-zA-Z\ ]+)$'
matches = []
regex = re.compile(pattern)
with open('text', 'r') as fh:
for line in fh:
matches.append(regex.match(line))
sunny_matches = []
for match in matches:
if match.group('weather').lower() == 'sunny':
sunny_matches.append(match)
print('Printing sunny weather:')
print('{}\n'.format('='*50))
print_data(sunny_matches, 'weather')
如果我们运行它,我们得到以下输出:
Printing sunny weather:
==================================================
Group: weather
--------------------------------------------------------------------------------
01/23/2016 | Seattle Washington | 5.8
04/25/2016 | Chicago Illinois | 31.3
04/25/2016 | Seattle Washington | 1.3
04/25/2016 | Indianapolis Indiana | 13
04/25/2016 | Buffalo New York | 33
04/5/2016 | Chicago Illinois | 1.3
04/25/2016 | Tacoma Washington | 3.3
07/5/2016 | Madison Wisconsin | 1.2
08/25/2016 | Milwaukee Wisconsin | 31
08/29/2016 | Chicago Illinois | 35
11/2/2016 | Chicago Illinois | 5.1
11/6/2016 | Sanwich Illinois | 4
11/16/2016 | Portland Oregons | 9
11/29/2016 | Washington DC | 7
12/10/2016 | St Louis Missouri | 3.2
12/25/2016 | Flint Michigan | 3.5
12/29/2016 | Detroit Michigan | 4.7
答案 1 :(得分:0)
不是调用拆分3次,而是调用一次并将结果存储在变量
中file = open('word.txt')
for line in file:
res = line.split()
weather = res[1]
date = res[2]
state = res[-1]
答案 2 :(得分:0)
您走在正确的轨道上 - 将数据组织到单个词典中可能更容易。
import operator
get_data = operator.itemgetter(1, 2, -1)
result = []
with open('file.txt') as f:
for line in f:
d = {}
line= line.strip()
line = line.split()
weather, date, state = get_data(line.split())
d['weather'] = weather
d['date'] = date
d['state'] = state
result.append(d)
或者如果你想保留这个城市,只需将每一行拆分三次
import operator
get_data = operator.itemgetter(1, 2, -1)
result = []
with open('file.txt') as f:
for line in f:
d = {}
line= line.strip()
line = line.split(maxsplit = 3)
weather, date, city = get_data(line)
d['weather'] = weather
d['date'] = date
d['city'] = city
result.append(d)