深入学习word2vec为小文本

时间:2016-03-16 19:03:08

标签: nlp text-mining word2vec

我正在使用R [here] [1]

中的word2vec

我的数据来自csv文件。以下是我的数据:

net
abap
access
account management
accounting
active directory
agile methodologies
agile project management
ajax
algorithms
analysis
android
android development
angularjs
ant
apache
asp
asp net
banking
bb
bpmn
budgets
business analysis
business development
business intelligence
business planning
business process
business process design
business strategy
c
change management
channel partners
cisco technologies
cloud computing
cms
competitive analysis
computer hardware
computer science
consulting
contract negotiation
corporate communications
crm
css
customer service
cvs
data analysis
data center
data migration
data warehousing
database design
databases
db
design patterns
direct sales
drupal
eclipse
ecommerce
economics
editing
ejb
english
enterprise architecture
enterprise software
erp
european union
event management
finance
financial analysis
firewalls
forecasting
french
git
hardware
help desk support
hibernate
html
human resources
iis
incident management
integration
it management
it service management
it strategy
itil
java
java enterprise edition
javascript
jboss application server
jdbc
jee
jira
jms
joomla
jpa
jquery
jsf
json
jsp
junit
key account management
leadership
linux
management
management consulting
market research
marketing
marketing communications
marketing strategy
matlab
maven
microsoft excel
microsoft exchange
microsoft office
microsoft sql server
microsoft word
mobile applications
mobile devices
ms project
mysql
negotiation
netbeans
network administration
network security
networking
new business development
object oriented design
oop
operating systems
oracle
oracle applications
oracle sql
outsourcing
photoshop
php
plsql
pmo
pmp
postgresql
powerpoint
presales
problem solving
product development
product management
product marketing
program management
programming
project management
project planning
project portfolio
public relations
public speaking
python
quality assurance
requirements analysis
requirements gathering
research
rest
retail
risk management
rup
saas
sales
sales management
sales operations
sap
sap erp
sap r
scrum
security
selenium
seo
servers
servlets
sharepoint
shell scripting
soa
soap
social media
social media marketing
social networking
software design
software development
software documentation
software engineering
software installation
software project
software quality
solution architecture
solution selling
spring
spring framework
spss
sql
sql server
startups
strategic planning
strategy
struts
subversion
system administration
systems analysis
tcpip
teaching
team building
team leadership
team management
teamwork
technical support
telecommunications
testing
tomcat
training
troubleshooting
tsql
uml
unix
unix shell scripting
user acceptance testing
vb net
virtualization
visio
visual basic
visual studio
vmware
voip
vpn
web applications
web design
web development
web services
weblogic
windows
windows server
wordpress
xml
xslt

我想提取文本集,以便对单词进行分类。我使用word2vec中的以下代码。

library(wordVectors)
model = train_word2vec("C:/Users/Desktop/input.csv",output="C:/Users/Desktop/output.vectors",threads = 3,vectors = 100,window=12)
nearest_to(model,model[["bussiness"]])

我希望看到基于商业的最近的单词,因为从输入文件的观察我可以看到存在但我只从最近的输出中获取NA

> nearest_to(model,model[["bussiness"]])
<NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 
  NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 

我可以做些什么来解决代码中的问题?       [1]:https://github.com/bmschmidt/wordVectors

1 个答案:

答案 0 :(得分:1)

查看您传递的参数的定义。窗口= 12,而你的线条最多2个单词没有意义。一般来说,你不会通过使用这里提供的这么多文本从word2vec中获得任何东西。您需要一个指标&amp;不依赖于共现的资源。使用WordNetRoget's Thesaurus。看看this(可能有用......)。