用文本语言拆分字符串

时间:2018-02-22 22:43:23

标签: r text nlp

我正在使用一组文本文档(每个文本文件存储为一个字符串),这些文档主要使用英语,但包括一些西班牙语文档和一些用英语和西班牙语重复相同信息的文档。我使用了cld3mixed.language.strings <- c("Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event! \nEnjoy the weather on this beautiful Sunday! \n ************** \n Departamento de Asuntos Culturales y Eventos Especiales: Hoy será el Primer Festival Anual de Mariachi y Balet Folklórico! Los grupos locales comienzan a las 1:00 pm y los grupos de renombre mundial empiezan a las 3:00 pm. Será en el Millennium Park. Inviten a su familia, amigo@s, y vecin@s a este evento completamente GRATIS!", "Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system.\n\nLlame o visite nuestra oficina para más información sobre un programa de la Ciudad ofreciendo dinero hacía la reparación o instalación de sistemas de calefacción. Dueños de casa de ingresos bajos son elegibles. \n\n 3476 S. Archer Ave. \n (773) 523-8250", "Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250.\n\nLos invito a un taller sobre el proceso de apalear sus impuestos de propiedad. Hogares en los West y East “townships” del Condado de Cook son elegibles ahora para apalear sus impuestos. Por favor refiéranse al volante añadido a este mensaje, o llame mi oficina al 773-523-8250, para más información." ) 个软件包(在R中实现了Chrome的语言检测功能)来估算语料库中每个字符串中包含的语言。我的目标是处理包含英语和西班牙语文本的所有字符串,以便保留英语部分并删除西班牙语部分。

以下是我正在使用的三个字符串的示例:

cld2

据我所知,cld3import React from 'react'; import GoldenLayout from 'golden-layout'; import {Provider} from 'react-redux'; import Default from '../modules/m-Default/Default'; interface GoldenLayoutWrapperProps {} interface GoldenLayoutWrapperState {} class GoldenLayoutWrapper extends React.Component <GoldenLayoutWrapperProps, GoldenLayoutWrapperState> { public layout; static contextTypes = { store: React.PropTypes.object.isRequired }; private wrapComponent(Component, store) { class Wrapped extends React.Component { render () { return ( <Provider store={store}> <Component {...this.props}/> </Provider> ) } } return Wrapped; } componentDidMount() { var layout = new GoldenLayout(this.layoutConfig, this.layout); layout.registerComponent('Default', this.wrapComponent(Default, this.context.store)); } render() { return ( <div className="golden-layout-wrapper" ref={input => this.layout = input}/> ) } private layoutConfig = {} } export default GoldenLayoutWrapper; 可以估算字符串中包含的语言,但无法根据语言提取字符串的部分内容。

R中是否有一个不同的包可用于识别每种语言中每个字符串的部分,并根据该字符串将字符串拆分为两个?

谢谢!对不起,如果这不清楚;这是我第一次发帖。

2 个答案:

答案 0 :(得分:1)

以下是我在评论中建议的方法的实现,如果您希望字符串中的语言之间有换行符,则可以使用该方法。 (在你的所有例子中都是这种情况。如果一般情况下不成立,也许你可以尝试分割换行符,句号,感叹号和问号)

library('cld2')

list.of.strings <- strsplit(mixed.language.strings, '\n')
ExtractEnglishSubstrings <- function(string.vector) {
  return(string.vector[which(detect_language(string.vector) == 'en')])
}

lapply(list.of.strings, ExtractEnglishSubstrings)

此输出

[[1]]
[1] "Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event! "
[2] "Enjoy the weather on this beautiful Sunday! "                                                                                                                                                                                                                                                                         

[[2]]
[1] "Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system."

[[3]]
[1] "Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250."

如果您更喜欢将字符串缝合在一起并以向量而不是向量列表返回,则此修改应该这样做...

ExtractEnglishSubstrings <- function(string.vector) {
  english.vector <- string.vector[which(detect_language(string.vector) == 'en')]
  reassembled.string <- paste0(english.vector, collapse=' ')
  return(reassembled.string)
}

unlist(lapply(list.of.strings, ExtractEnglishSubstrings))

返回

[1] "Department of Cultural Affairs and Special Events: Today will be the First Annual Mariachi and Folklorico Festival! Local groups begin at 1:00pm and world renowned headliners start at 3:00pm. It will be located in Millennium Park. Invite your friends, family, and neighbors to participate in this FREE event!  Enjoy the weather on this beautiful Sunday! "
[2] "Call or walk into our office for information on the Emergency Heating Repair Program which provides eligible low-income, owner-occupied homes grants for a new heating system."                                                                                                                                                                                    
[3] "Join me and other local elected officials for a workshop on appealing your property taxes. Homes in West & South townships of Cook County are currently eligible to appeal. See flier for more info, or call my office at 773-523-8250."   

答案 1 :(得分:0)

对于处理类似问题的其他人,我对HardlandMason提供的优秀代码做了一些修改。

下面的代码保留了子串,类似于HarlandMason提供的,有两个变化:(1)它不是接受字符串向量作为输入,而是接受单个字符串和输出单个字符串; (2)它允许您输入您想要保留的子串的语言(以cld2::detect_language函数所需的格式)。

KeepSubstrings.bylanguage <- function(string, language) {
string.vector <- unlist(strsplit(string, '\n'))
cut.vector <- string.vector[which(cld2::detect_language(string.vector) == language)]
reassembled.string <- paste0(cut.vector, collapse=' ')
return(reassembled.string)
}

下面的代码类似,但删除某种语言的子串,而不是只保留某种语言的子串(如果某些子串非常短,这会对{ {1}}函数,你宁愿对保留函数不确定的子字符串更加谨慎):

detect_language

根据上一个答案中的建议,这两个字符串都可以使用RemoveSubstrings.bylanguage <- function(string, language) { string.vector <- unlist(strsplit(string, '\n')) cut.vector <- string.vector[which(cld2::detect_language(string.vector) != language)] reassembled.string <- paste0(cut.vector, collapse=' ') return(reassembled.string) } lapply应用于字符串向量。