EMERGING MARKETS INFORMATION SERVICE
Emerging Markets Information Service (EMIS) delivers deep, rich company and industry information, alongside the relevant proprietary and multi-source news, research and analytics that allow professionals to make profitable decisions faster.
This single resource of hard-to-get information covers more than 100 emerging markets, includes company profiles and financials from more than 1.3 million listed and private companies, offers single company and industry analysis, and delivers proprietary and multi-source news and research from over 9,000 publications, all delivered via an easy-to-use interface.
The challenge
The EMIS team came to us as they had a legacy commercial search technology installation that had a poor user experience, poor relevancy, and a high cost of ownership. They wanted to scale for more dynamic and personalised content, and needed a backend that would future proof their growth.
The result
Internet Securities Inc. (ISI, a subsidiary of Euromoney Institutional Investor) and Artirix are proud to announce the launch of a new platform for the Emerging Markets Information Service (EMIS –www.securities.com). The end result is a rich, highly accurate and personalised new application, helping EMIS subscribers more effectively discover important news and research.
The upgraded solution runs on Artirix’s scalable cloud platform. It provides dynamic search and personalised content pages per user, with greatly improved quality in all languages, including Chinese Mandarin. The backend covers over 1.2 terabytes of full text and structured metadata, and is future proofed for growth. In addition, Artirix’s service model has allowed ISI to reduce costs and operational risk.
The Solution
In summary the Artirix platform for this service supports 200 million financial news articles, which equate to a 1.2TB index size in the core search technology in our platform, Elasticsearch.
It is used for free text queries in one of 15 languages, combined with structured filters, and around 10 facets are used in the user interface.
Under the hood of our platform we combined several technologies to support this scale:
- Artirix document processing – to analyse, augment and normalize in bound data
- MongoDB – for storage alongside Elasticsearch
- RabbitMQ – to manage the data flow into 2 separate clusters, and a custom river for Elasticsearch
- Artirix cottontail – transmits index status updates from RabbitMQ to updates in MongoDB
- Elasticsearch for the core search system with some customizations for analysis, and query extensions
- Customer Query API which allows textual queries to be translated into elasticsearch DSL
BENCHMARKING & HOSTING
To ensure we delivered a service which met the volume, query loads expected we did extensive benchmarking on Amazon EC2 with various instance types against the number of shards per index to determine the best cost versus performance balance. In the end we settled for SSD boxes from Amazon.
APIS
The service is accessed via REST APIs – an index API for adding / deleting new documents, and a query API for searching.
ARCHITECTURE
Parts of the platform architecture we used were inspired by talks at an Elasticsearch London Meetup, and its best understood by following a single document through the system:
MongoDB is used as the canonical data store, and holds both the document data and the current status of each document. Documents added to the system are stored in a dual-region mongo replica set as soon as possible after they are received by the Artirix Index API. After this, they are pushed onto a RabbitMQ queue from where they will be received by the Custom RabbitMQ River running in Elasticsearch. This river indexes the documents, and posts a status message about the document back to RabbitMQ. Finally, the message is picked up by Artirix Cottontail which updates the status of the document in MongoDB.
We run two independent Elasticsearch clusters. This allows one to be rebuilt in the case of disaster while the other serves search requests.
ELASTICSEARCH
The documents contain data in 15 languages. These are handled by the built in language analyzers in Elasticsearch. In addition we use a custom analysis step to index each word in both its stemmed and unstemmed form. This allows for example to look for exact word matches when performing phrase searches, but to use stemming at other times.
We also built a custom plugin to allow the use of wildcards in span queries.
VOLUMES
- Documents: 204m
- MongoDB Index Size: 1.2TB
- Elasticsearch Index Size: 1.2TB
Year: 2013
Services
- Software Development
- Data + Content Search
- Support + Maintenance