我正在尝试构建类似于Google文字片段的内容。 Google代码段包含突出显示的关键字,并且如果关键字未出现在分析字符串的开头,则可以很好地“移动”文本。
例如:
关键字“nike”
haystack string“lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor难怪耐克是最大的品牌之一在世界上不是lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor
应该成为此片段:
... lorem ipsum dorlor毫无疑问, nike 是世界上最大的品牌之一,不是lorem ipsum dorlor lorem dorlor lorem ipsum dorlor loremdorlor lorem ipsum dorlor loremipsum dorlor lorem ipsum dorlor lorem ...
这就是我的想法:
keywordPosition = haystack.lower().index(keyword.lower())
snippetStart = keywordPosition - 100
snippetEnd = keywordPosition + 200
haystack = " ..." + haystack[snippetStart:snippetEnd] + " ..."
在python中是否有一种优雅的方式来动态调整snippetStart和snippetEnd?在许多情况下,由于haystrack切片索引超出范围,上述方法显然会引发异常。
答案 0 :(得分:2)
我在这里创建了一个带有评论的小例子。
http://pythonfiddle.com/google-snippet
haystack = "lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor it is no wonder that nike is one of the largest brands in the world is not lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor"
needle = "nike342"
lookahead = 7 # Number of tokens to show before "nike"
tokens = haystack.split(" ") # Split string into a list of tokens
found_index = -1 # Represents the index of the token. Initialize to -1 and assume it doesn't exist.
# Loop through tokens and compare each to the needle. If we find the needle, rememeber the index and break out of the loop
found_index = tokens.index(needle)
try:
found_index = tokens.index(needle)
# Get the max of the found index minus the number of words to show before the needle, and 0
found_index = max(found_index - lookahead, 0)
# Create a sub list of the tokens from the found_index and end, then join those terms back together with a space.
snippet = " ".join(tokens[found_index:len(tokens)])
except ValueError:
snippet = "" # No snippet or whatever error handling you are going to do
print snippet