This帖子展示了如何使用Spacy的标记符获取Conll格式的文本块的依赖关系。这是解决方案:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
输出此块:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
我想在不使用doc.sents
的情况下获得相同的输出。
确实,我有自己的句子分割器。我想使用它,然后一次给Spacy一个句子来获得POS,NER和依赖项。
如何使用Spacy获取Conll格式的一个句子的POS,NER和依赖关系,而无需使用Spacy的句子分割器?
答案 0 :(得分:1)
@WebServlet(name = "PatientControllerServlet")
public class PatientControllerServlet extends HttpServlet {
private PatientDBUtil patientDBUtil;
private DataSource dataSource;
@Override
public void init() throws ServletException {
super.init();
//create our patient db util .. and pass in the conn pool /datasource
try{
patientDBUtil = new PatientDBUtil(dataSource);
}
catch (Exception exc){
throw new ServletException(exc);
}
}
protected void doGet(HttpServletRequest request, HttpServletResponse
response) throws ServletException, IOException {
// list the patients .. in MVC fashion
try {
listPatients(request, response);
} catch (Exception exc) {
throw new ServletException(exc);
}
}
private void listPatients(HttpServletRequest request, HttpServletResponse
response) throws Exception {
// get patients from db util
List<Patient> patients = patientDBUtil.getPatients();
// add patients to the request
request.setAttribute("PATIENT_LIST", patients);
//send to JSP page (view)
RequestDispatcher dispatcher =
request.getRequestDispatcher("/list_patients.jsp");
dispatcher.forward(request, response);
}
}
中的Document
是可迭代的,并且在文档中指出它迭代sPacy
s
Token
因此,我相信您只需为每个被拆分的句子制作 | __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
,然后执行以下操作:
Document
当然,按照CoNLL格式,您必须在每个句子后打印换行符。
答案 1 :(得分:0)
This帖子是关于使用spacy句子边界检测的用户面临意外的句子中断。 Spacy开发人员提出的解决方案之一(如文章所述)是增加灵活性来添加自己的句子边界检测规则。这个问题与Spacy的依赖解析相结合,而不是在它之前解决。因此,我不认为Spacy目前支持的是什么,尽管可能在不久的将来。
答案 2 :(得分:0)
@ashu的回答部分正确:spaCy中的设计将紧密依赖解析和句子边界检测紧密结合在一起。虽然有一个简单的哨兵。
https://spacy.io/api/sentencizer
似乎哨兵只使用标点符号(不是完美的方式)。但是,如果存在这样的哨兵,那么您可以使用自己的规则创建一个自定义哨兵,这肯定会影响句子的边界。