Enterprise Search

Enterprise Search using ElasticSearch

A typical Software as a Service (SaaS) platform holds a lot of data specific to each enterprise that uses the platform. More loosely connected data gets stored as new features and modules are added to the platform. Evolving features require users to search across various modules, structured and unstructured data such as PDFs and word documents. “Enterprise search” describes the ability to search for such data within an enterprise. In the SaaS platform I worked on, each module implemented its own search using SQL queries. This led to a fragmented User Experience (UX) for the customers. In this document, I go over the architecture and implementation I led to enable ‘Enterprise Search’ for the platform.

Technology Evaluation:

The following search technologies were evaluated -

Azure Cognitive Search
Elastic Search
SQL full text search

The evaluation criteria included — scalability, performance, price, API support, maintenance and operations effort. Based on these criteria, ElasticSearch was selected as the search technology for the solution.

Data Flow Diagram:

Figure below shows the architectural flow of the solution.

The critical architecture components are :

Search Data Ingestion — This service receives data from all the modules. The data is sent via a RestAPI, and the JSON data format follows a well-defined ‘contract.’ The service then publishes the data as an ElasticSearch document in the appropriate index.
Search Service — User-entered search text from the UI is sent to the Search Service. The service converts the text to a query and sends it to ElasticSearch. The returned result is formatted for display for the UI. The service further aggregates the results around defined attributes.
Mapping library — One of the requirements was to support Customer Managed Keys (CMK) for encryption of data-at-rest. This requires supporting multiple ES clusters in one environment. To this end, a mapping was maintained in the SQL server between the tenantID and the cluster hosting the data for this tenant. The search library held a cache of this mapping and provided this data to the Data Ingest Service and Search Service.
Kafka message broker and modules — Various modules that provide services to users store the data in an SQL server. The consumer in the Data Ingestion Service processes the data, determines the destination ElasticSearch(ES) cluster and index, and then pushes the data into ES. The Search Data ingestion service offers a library that defines the Rest Contract (Schema) for data ingestion. All the modules use this library and publish the data to a Kafka topic.
Onboarding Service — When tenants buy new modules and features, Onboarding Service handles the request. This service (in addition to other onboarding functions) updates tenantID to ES cluster assignment in the database and sends an event to modules to start ingesting data into ES.

Architecture and Technical details:

Access Control: Role-Based Access Control (RBAC) is built into most SaaS platforms by assigning privileges(permissions) to roles and assigning roles to users or groups. Translating the same controls to ElasticSearch can be a challenge, as dynamic JOIN queries are not supported in ES. To apply the same controls that are present in SQL server data and API’s accessing them, I came up with a two step approach:

Document Ingestion — As part of the ingestion each document stores fields that can identify the access controls. TenantID would identify the enterprise and groups within a tenant can use OrganizationID. Further, for each type(class) of Object that is ingested, an ObjectType is ingested. If the SaaS platform has well-defined roles, then the document can store the roles that can access it. In some cases where documents are directly assigned to the end-user, a ‘has-child’, ‘has-parent’ relationship can be created at the time of ingestion.
Search Request — The backend API to handle the search request would have standard controls that get applied for API. In addition to this, the Organization hierarchy, object type permissions and user context related information is specified in the search query as filter criteria. These restricted query results provide the same controls as present in the SQL server.

ES provides a collection of mechanisms for access control, however, this may not translate directly to the business need, for such cases, the two step approach would work well.

Initial Migration and Ongoing Updates: Publishing data to the ES cluster happens under two scenarios:

Initial migration — When the search feature is enabled for an existing tenant, various modules ingest data into a Kafka topic setup for initial migration. For new tenants this operation doesn’t apply as these tenants will not have any data to migrate.
Ongoing updates — All tenants that have Search enabled will have ongoing updates that need to be published to ES cluster. A separate Kafka topic is set up for ongoing updates. All the data that is sent is required to follow the REST contract as provided by a data-ingest library.

Concurrency issues can arise during updates to ES clusters. This is addressed by ensuring each publish to ES cluster is full document update and replaces the previous version of the document. In addition, using the ES provided feature of using ‘Optimistic Lock’. The feature is defined in [6] as: “Instead of acquiring a lock every time, you tell ElasticSearch what version of the document you expect to find. If the document didn’t change in the meantime, your operation succeeds, lock free. If something did change in the document and it has a newer version, ElasticSearch will signal it to you so you can deal with it appropriately.”

In our application we used the epoch value as the external version number. The epoch time of each publish was set at the time the user published the document. This way out-of-order updates to the documents would get rejected.

Multi-Language Support: Most SaaS applications span multiple countries and support multiple languages. There are several approaches to implement multiple-language support as listed in [7]. In this implementation the approach I decided was to use a separate field for each language within an index. For our application the search would primarily be performed on three fields — Title, Summary and Description. This would increase the number of fields, however most fields would be empty for a given document and there would be minimal impact on performance.

ES supports using an ‘ingest processor’ for automatic detection and copying of the data to the appropriate language field. However, in our application, we decided to use the platform language setting field as the default language of the user entered text and copy that data to the appropriate field title_{langcode} as part of data ingestion. ES would use the appropriate analyzer specified for the field to create the ‘inverted index’ necessary for search. As part of user search, the ‘search service’ would query ES against the title_{langcode} field and provide the results to the user.

Encryption — Data at Rest and in-Transit: Data encryption at rest is implemented by using a set of encryption keys shared between all nodes of an ElasticSearch cluster [8]. The encryption key is provided at the time of deployment. All nodes in a given ES cluster must share the same key. ES at the time of this blog, doesn’t have a mechanism for a SaaS tenant to use their own keys, i.e. ‘customer managed key’ support is not present at this time.

ES supports enabling TLS for encrypting traffic. This can be enabled via configuration and setting up key stores and configuring either a trusted SSL certificate or allowing custom Certificate Authority (CA) [9].

Provisioning & Operations:

Provisioning — ES is available as a resource in Microsoft Azure and can be provisioned via Terraform or other scripting tools. As part of the provisioning, ES credentials, initial keys for encryption, certificate and Kibana Dashboard setup needs to be done. A private link between the application and ES can be established that provides additional security for data in-transit.
Index aliasing — As a system evolves, data storage needs increase, there will be a need to ‘reindex’ to adjust the number shards for the index or move the index to an ES cluster to one with more capacity. Applications access the data via the ‘alias index’ [11]. When a new index is created, and the current index data is migrated over to the new index, the application can then seamlessly start accessing the data from the new index.
Monitoring — Monitoring the health of the search cluster can be done by another ES cluster. All the monitoring and logs from search cluster, can be sent to the monitoring ES cluster. Kibana provides easy visualization of the collected monitoring data.

ES Cluster Cost Estimates: The initial deployment and ongoing operational cost of a solution can be a significant factor in the selection of the technology stack. For the solution I worked on, two form factors of the ES cluster were used- medium and large, at an annual cost of about $3000 and $5000 respectively. The total number of clusters for each environment varied based on number of tenants and use case.

Conclusion:

Enterprise search feature is seen as a ‘must have’ for SaaS platforms. It can improve the overall User Experience for the platform and make it easier for users to find related documents stored across modules. ElasticSearch provides an excellent framework to enable the Enterprise search feature. In this article, I went over the design and architectural aspects such as — Access control, Data Migration, Language Support, Provisioning and Operation support and Cost calculation, that was required on the SaaS platform to support Enterprise Search. These architectural areas are fairly generic, with minor customization the approach described in this article can work for most SaaS platforms.

Ramesh Krishnamurthy, PhD

Enterprise Search using ElasticSearch

Technology Evaluation:

Data Flow Diagram:

Architecture and Technical details:

Conclusion:

References:

Enterprise Search using ElasticSearch

Technology Evaluation:

Data Flow Diagram:

​

Architecture and Technical details:

Conclusion:

References: