我有从文本文件中提取的几个单词或术语的索引,例如:
position = 156
文字blob:
1 section react
The following serious adverse reactions are discussed in greater detail in other sections of the prescribing information:
* Peripheral Neuropathy [see Warnings and Precautions ( 5.1 ) ]
* Anaphylaxis and Infusion Reactions [see Warnings and Precautions ( 5.2 ) ]
* Hematologic Toxicities [see Warnings and Precautions ( 5.3 ) ]
* Serious Infections and Opportunistic Infections [see Warnings and Precautions ( 5.4 ) ]
* Tumor Lysis Syndrome [see Warnings and Precautions ( 5.5 ) ]
* Increased Toxicity in the Presence of Severe Renal Impairment [see Warnings and Precautions ( 5.6 ) ]
* Increased Toxicity in the Presence of Moderate or Severe Hepatic Impairment [see Warnings and Precautions ( 5.7 ) ]
这个词是:
Peripheral Neuropathy
所以我的问题是:
A)给定一个位置,我如何提取句子,例如:
在:
position = 156
out:
Peripheral Neuropathy .... <the full sentence>
B)给定一个位置,如何用一个100字的单词提取位置字
到目前为止,我试图:
content[5:-1]
和
def extract(start,text):
return text[start:200+start]
extract(5,content)
但是,由于我正在使用-1
,因此它将全文返回给我。还有其他方法可以完成这项任务吗?
*
请注意,内容是包含我正在使用的文字的列表。
答案 0 :(得分:1)
words = sum(map(str.split, content), [])
sentence = ' '.join(words[position-1:]).split('.')[0] + '.'
words = sum(map(str.split, content), [])
hundredtokens = ' '.join(words[position-1:position+100]) + '.'
答案 1 :(得分:1)
另一种解决方案是使用正则表达式。
>>> mystr = """The effort, led by Shoukhrat Mitalipov of Oregon Health and Science University, involved changing the DNA of a large number of one-cell embryos with the gene-editing technique CRISPR, according to people familiar with the scientific results. Until now, American scientists have watched with a combination of awe, envy, and some alarm as scientists elsewhere were first to explore the controversial practice. To date, three previous reports of editing human embryos were all published by scientists in China."""
>>> import re
>>> match = re.search(r'^(?:\S+\s+){5}([^.]*\.)', mystr).group(1)
match.group(1)
'Mitalipov of Oregon Health and Science University, involved changing the DNA of a large number of one-cell embryos with the gene-editing technique CRISPR, according to people familiar with the scientific results.'
假设你拥有的是字符串中的单词列表,这里有另一种解决方案:
newstr = ""
words = mystr.split(' ')
word_iter = iter(words[5:])
while not newstr.endswith('.'):
newstr += next(word_iter) + ' '
哈哈,对于我在你的帖子中理解你的文字,还有另一个解决方案。我用它作为:
mystr = """* Peripheral Neuropathy [see Warnings and Precautions ( 5.1 ) ]
* Anaphylaxis and Infusion Reactions [see Warnings and Precautions ( 5.2 ) ]
* Hematologic Toxicities [see Warnings and Precautions ( 5.3 ) ]
* Serious Infections and Opportunistic Infections [see Warnings and Precautions ( 5.4 ) ]
* Tumor Lysis Syndrome [see Warnings and Precautions ( 5.5 ) ]
* Increased Toxicity in the Presence of Severe Renal Impairment [see Warnings and Precautions ( 5.6 ) ]
* Increased Toxicity in the Presence of Moderate or Severe Hepatic Impairment [see Warnings and Precautions ( 5.7 ) ]
"""
首先,我们使用正则表达式获取字符串中的第五个单词。
target_word = re.findall('\w+', mystr)[4]
然后我们在字符串中获取它的索引:
word_index = mystr.index(target_word)
然后我们创建迭代器:
word_iter = iter(mystr[index:])
然后循环直到该行的结尾:
newstr = ""
while not newstr.endswith('\n'):
newstr += next(word_iter)