Cluster-Driven Model for Better Word and Text Embedding

This is the source code that can reproduce the results reported in the paper 'Cluster-Driven Model for Better Word and Text Embedding', which is accepted by ECAI2016 (http://ebooks.iospress.nl/volumearticle/44747). The publication of the paper is not the end. We will continue updating the code and other contents about the paper in the following months.

The project includes following parts:

(1) the first part is about sentiment classification tasks. Over 4 percents improvements are witnessed with the introduction of the global information. Our experimental results also shed some light on the Paragraph Vector (PV) models[Le and Mikolov, 2014]. We show that the superiority of PV comes from global information, which is introduced indirectly by paragraph vector, rather than the way of training paragraph vectors. (PV trains paragraph vectors in predict way, which is the same with models in word2vec toolkit.)

run go.sh to observe the effectiveness of global information on sentiment analysis.
chmod +x go.sh
sudo ./go.sh

(2) Theoretical analysis is provided in this paper to unveil the relationsips among PV-DBOW (a variant in PV), SPPMI matrix of term-document type and our centric model. In fact, their relationships are also noticed in [Sun et al, 2015]'s work. However, we are the first to use the SPPMI matrix to obtain text representations. The experimental results show that our novel co-occurrence matrix can still generate near state-of-the-art results on sentiment classification. After downloading the datasets, run the following:

cd svd
sudo svd.sh
sudo svd_rt2k.sh

(3) The third part is about the word analogy tasks. Experimental results show that global information can improve accuracy in semantic questions significantly.

### Contact us Zhe Zhao, Renmin university of China.
Thanks for the help from Shen Li and his contributions to this project
If you have any questions about the project please contact me: 1152543959@qq.com

Acknowledgements

We thank Grégoire Mesnil et al. for their Paragraph Vector implementation and Tomas Mikolov et al. for their word2vec implementation. Their works teach us a lot!

References

[Le and Mikolov, 2014] Distributed representations of sentences and documents
[Sun et al, 2015] Learning word representations by jointly modeling syntagmatic and paradigmatic relations

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
svd		svd
word2vec		word2vec
README.md		README.md
data.sh		data.sh
go.sh		go.sh
install_liblinear.sh		install_liblinear.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

svd

svd

word2vec

word2vec

README.md

README.md

data.sh

data.sh

go.sh

go.sh

install_liblinear.sh

install_liblinear.sh

Repository files navigation

Cluster-Driven Model for Better Word and Text Embedding

Acknowledgements

References

About

Releases

Packages

Languages

zhezhaoa/cluster-driven

Folders and files

Latest commit

History

Repository files navigation

Cluster-Driven Model for Better Word and Text Embedding

Acknowledgements

References

About

Resources

Stars

Watchers

Forks

Languages