文本文件1具有以下格式:
'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2
等
即,由冒号后跟数字分隔的单词。
文本文件2具有以下格式:
'WORD'
'WORD'
等。
我需要从文件1中提取单个单词(即,只有WORD而不是多个单词),如果它们与文件2中的单词匹配,则返回文件1中的单词及其值。
我的代码功能很差:
def GetCounts(file1, file2):
target_contents = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = open(file2).readlines() #file 2 as list -> 'WORD'
ls_stripped = [x.strip('\n') for x in match_me_contents] #get rid of newlines
match_me_as_regex= re.compile("|".join(ls_stripped))
for line in target_contents:
first_column = line.split(':')[0] #get the first item in line.split
number = line.split(':')[1] #get the number associated with the word
if len(first_column.split()) == 1: #get single word, no multiple words
""" Does the word from target contents match the word
from match_me contents? If so, return the line from
target_contents"""
if re.findall(match_me_as_regex, first_column):
print first_column, number
#OUTPUT: WORD, n
WORD, n
etc.
由于使用正则表达式,输出是空洞的。例如,代码将返回'asset,2',因为re.findall()将匹配match_me中的'set'。我需要将target_word与match_me中的整个单词进行匹配,以阻止部分正则表达式匹配导致的错误输出。
答案 0 :(得分:2)
如果file2
不是很大,那就把它们塞进一套:
file2=set(open("file2").read().split())
for line in open("file1"):
if line.split(":")[0].strip("'") in file2:
print line
答案 1 :(得分:1)
我认为“功能不良”你的意思是速度明智吗?因为我测试了它确实有效。
通过在file2中创建set
个单词,可以提高效率:
word_set = set(ls_stripped)
然后代替findall
你会看到它是否在集合中:
in_set = just_word in word_set
也比正则表达式更清洁。
答案 2 :(得分:1)
看起来这可能仅仅是grep的一个特例。如果file2本质上是一个模式列表,并且输出格式与file1相同,那么你可以这样做:
grep -wf file2 file1
-w
告诉grep只匹配整个单词。
答案 3 :(得分:0)
这就是我这样做的方式。我手边没有python翻译,所以可能会有一些拼写错误。
进入Python时应该记住的主要事项之一(特别是来自Perl)是正则表达式通常是一个坏主意:字符串方法功能强大且非常快。
def GetCounts(file1, file2):
data = {}
for line in open(file1):
try:
word, n = line.rsplit(':', 1)
except ValueError: # not enough values
#some kind of input error, go to next line
continue
n = int(n.strip())
if word[0] == word[-1] == "'":
word = word[1:-1]
data[word] = n
for line in open(file2):
word = line.strip()
if word[0] == word[-1] == "'":
word = word[1:-1]
if word in data:
print word, data[word]
答案 4 :(得分:0)
import re, methodcaller
re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))
res = []
for match in re_target.finditer(open(file1).read()):
word, value = match.groups()
if word in match_me_contents:
res.append((word, value))
答案 5 :(得分:0)
我的两个输入文件:
file1.txt
:
'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2
file2.txt
:
'WORD'
'NONEXISTENT'
如果file2.txt
保证一行中没有多个字,则无需从第一个文件中明确过滤这些字。这将由会员资格测试完成:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# This assumes that strings with a space (multiple words) do not exist in
# the second file.
if word in allowed_words:
print word, count
运行它会给出:
$ python extract.py
'WORD' 1
如果file2.txt
可能包含多个单词,只需在循环中修改测试:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# This prevents multiple words from being selected.
if word in allowed_words and not ' ' in word:
print word, count
注意我没有打扰从单词中删除引号。我不确定这是否必要 - 这取决于输入是否保证有它们。添加它们将是微不足道的。
您应该考虑的其他事情是区分大小写的。如果应将小写和大写单词视为相同,则应在进行任何测试之前将所有输入转换为大写(或小写,无关紧要)。
编辑:从允许的单词集中删除多个单词可能效率更高,而不是对file1
的每一行进行检查:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f if not ' ' in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# Check if the word is allowed.
if word in allowed_words:
print word, count
答案 6 :(得分:0)
这就是我提出的:
def GetCounts(file1, file2):
target_contents = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = set(open(file2).read().split('\n')) #file 2 as list -> 'WORD'
for line in target_contents:
word = line.split(': ')[0] #get the first item in line.split
if " " not in word:
number = line.split(': ')[1] #get the number associated with the word
if word in match_me_contents:
print word, number
版本变更:
number = line.split(': ')[1]
以查看该单词是否包含用于提高效率的空格,虽然速度差异很小(几乎可以肯定,大部分时间用于检查工作是在目标中) 如果实际输入不是您提供的格式,则只会发生潜在的错误。
答案 7 :(得分:0)
让我们利用文件格式与Python表达式语法的相似性:
from ast import literal_eval
with file("file1") as f:
word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}