RusVectōrēs: A Yearly Report

RusVectōrēs is a web service for distributional semantics created at the School of Linguistics. The service allows users to play with distributional semantic models (a.k.a. word embeddings) right in the browser. Last year we added lots of new features, and we are happy to share the news with you.

Frameworks and algorithms for training word embedding models (word2vec, GloVe , fasttext and others) have revolutionized NLP in recent years. These approaches enable computers to learn the actual meaning of words based on their co-occurrence statistics calculated on large text corpora. With RusVectōrēs one can try out the distributional semantics models trained on various corpora for the Russian language. Alternatively, one can download them to work locally on his/her own machine.

To walk the user through our web service, we prepared a screencast with the brief explanation of the main features (in Russian):

RusVectōrēs can be used to explore the possibilities of distributional semantics, to promptly check linguistic hypotheses, or as a classroom tool.

In 2016, the service has improved in the following ways:

Our new address is http://rusvectores.org/en. The old address http://ling.go.mail.ru/dsm is still alive, but it is better to use the new one.
The models are retrained on updated corpora: the news corpus includes texts up to November 2016, the Wikipedia dump is also updated to this date, the texts from the Russian National Corpus are extracted in a more complete way.
All corpora have been processed with the language identifier. Thus, accidental sentences in Ukrainian, Belorussian or Kazakh languages have been filtered out.
Previously, our models employed part-of-speech tags that were standardized with Mystem/RNC. To facilitate multilingual comparison of results, we now employ Universal PoS Tags. Thus, «модель_S» has turned to «модель_NOUN». You can still enter the query without any PoS tag at all — RusVectōrēs will detect it automatically.
Two-word entities with high collocation strength (measured by PMI) were glued together with «::» and now possess their own representations (vectors). Thus, now there are some bigrams in our models, for example, «боб::дилан_NOUN».
The performance of all models was evaluated on widely known SimLex999 and Google Analogies test sets.
As you type a query, there are now adaptive hints with the words actually present in the models. If there are no hints, it does not always mean that there is no such word in the models: it is still possible that the word is simply not frequent enough to appear as a hint.
It is now possible to get the similarity measure for a word pair through API. You can also get results for the associates queries in two formats: json and csv. Learn more on the «About» page!
Many small bugs and errors were fixed. We have possibly added some new errors, but we will eventually fix them too.
The framework under RusVectōrēs umbrella has been released on Github as an open-source project WebVectors. This means that you can now create your own web service similar to RusVectōrēs with models and languages of your interest. You can look at the example of such service for Norwegian and English. In April, we are going to present WebVectors at the demo session of EACL-2017. If you happen to visit it, we would be happy to hear your opinion in person!