Localized Search Implementation with Elasticsearch

Can we say Elasticsearch is great for localized search? Let’s do a check. It is an engine that gives you most of the standard search features out of the box. There are many ways to look for an optimal window to implement fast and indexed document search, scoring docs based on certain formulas, autocomplete search, context suggestion, localized text comparison based on analyzers and so on!

I am here to discuss about implementing a localized search for remote languages, regardless of being supported by analyzers in ES or not and how to get good results (for starters), if not the best.

I will use Node.js and ES as the technical stack. Let’s define some standard types for our index schema. I have three cases considered here:

  1. English Analyzer
  2. Hindi Analyzer (Comes tagged with ES. See: Language Analyzers)
  3. Standard Analyzer (Use if your language does not have an inbuilt analyzer in ES)


const standardTypes = {
'KEYWORD_ASCII': {
'type': 'keyword',
'ignore_above': 256
},
'STANDARD_TEXT': {
'type': 'text',
'analyzer': 'standard'
},
'ENGLISH_TEXT': {
'type': 'text',
'analyzer': 'english'
},
'HINDI_TEXT': {
'type': 'text',
'analyzer': 'hindi'
}
}
module.exports = standardTypes;

We need to define the schema in a way to support all the standard types. I have chosen three languages to display search. English, Hindi ( Indian native ), and Telugu (Regional South Indian Language with no default analyzer in ES).


'use strict';
…….
"descriptors": {
"properties": {
"english": {
"properties": {
"description": standardTypes.ENGLISH_TEXT,
"name": standardTypes.ENGLISH_TEXT
}
},
"hindi": {
"properties": {
"description": standardTypes.HINDI_TEXT,
"name": standardTypes.HINDI_TEXT
}
},
"telugu": {
"properties": {
"description": standardTypes.STANDARD_TEXT,
"name": standardTypes.STANDARD_TEXT
}
}
}
}
..
module.exports = schema;

view raw

schema.js

hosted with ❤ by GitHub

We have Telugu under standard analyzer as it is based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 and works well for most languages. We can also use Simple Analyzer as it is a modified form of Standard Analyzer and divides text on characters which are not a letter.

Now, we have a schema defined. Next, you create an index with the schema and populate the index with related documents. I am not sharing actual documents which were used for my testing, but one can find text resources online to populate an index. For Node.js, one can use ES client for Node.js, or an easier way would be ES rest API.

There is a whole variety of search one can perform on a document having the above schema for all fields with custom analyzers. [ Full-Text Queries in ES ]

I was able to get great search results for English and Hindi, and search results for Telugu were not much below the bar. The ease with which one can create an almost real-time search engine is something unbelievable. I have not gone into many technical details of analyzers and how they function by combining the appropriate character filterstokenizer, and token filters. It is expected for a standard analyzer to be just acceptable with the results, of course, it is only for starters. An Elasticsearch user must implement full-fledged custom analyzer for a regional language to get more accurate results. Moreover, ES provides with few add-ons for Asian languages such as Korean, Chinese, etc.

So, we can conclude that Elasticsearch is indeed great to boost your product’s localization and accessibility in small time cost and high return value.

Advertisement

GSoC 2017: Charmap Integration

These awesome three months of summer spent developing for LibreOffice under Google Summer of Code, have filled me with great zeal and zest. A plethora of important additions was made to the software bundle under the project titled “Usability of Special Characters”, and these new features will be made available in the version 6.0 of LibreOffice (Release Notes for 6.0). Here is a glimpse of what the users will be receiving in the new update.

Note: Please zoom-in the web page or open the GIF’s in the new tab if the character grid is not correctly visible.

Screenshot from 2017-08-22 21-21-46.png

Special Characters in LibreOffice Master

 

‣ Search functionality via generic code point name

search2.gif

Glyph name properties have been introduced to LibreOffice using the API provided by International Components for Unicode (ICU). The program identifies glyphs according to their names provided by ICU and then, the search results are displayed. There’s a display label which is dedicated to glyph’s Unicode name.

‣ Inter-font dynamic glyph search

inter-font search.gif

As simple as it could be made, a user can now type the name of the glyph and scroll between fonts until the desired results are shown.

‣ Recently Used Characters and Favorite Characters

recent_special.png

‣ Toolbar Dropdown control for Quick Access!

In pursuance of providing quick access to the above Recent and Favorite character list, a toolbar dropdown control has been developed. It is supposed to replace the current toolbar button which opens the special character dialog in the currently circulated LibreOffice 5.3.

ToolbarDropdown.gif

The GIF below is an example of how easy a user can find the desired symbols and can pin it for quick access in future.

favorites.gif

‣ Context-menu and Mouse click controls for easier interaction

recent.gif

Link to the major patch submissions:

Glyph View and Recent Characters Control in Special Characters dialog https://cgit.freedesktop.org/libreoffice/core/commit/?id=710a39414569995bd5a8631a948c939dc73bcef9

Favourites feature in Special characters https://cgit.freedesktop.org/libreoffice/core/commit/?id=f9efee1f87262b0088c249b2c306fb53ca729b53

‣ Special Characters Toolbar Dropdown Control https://cgit.freedesktop.org/libreoffice/core/commit/?id=800ac37021e3f8859a52c5eebca261a5d3bc5a11

‣ Unicode Character Names Integration using ICU https://cgit.freedesktop.org/libreoffice/core/commit/?id=43d65d1ab81a278e1352f64def9ca63b9e7dfab9

‣ Search feature for Special Characters https://cgit.freedesktop.org/libreoffice/core/commit/?id=e74be9ad773c7769c5d8765bb2ac234967e420ec

I was mentored by Samuel Mehrbrodt, Heiko Tietze, and Thorsten Behrens in GSoC 2017. I would like to give my regards to the LibreOffice community which helped me through the deadlocks I faced during the project. It has been an awesome two-year journey with LibreOffice, and I hope it will remain the same in future and the open-source technologies will flourish with their full potential and thrive to its zenith.

Usability of Special Characters: GSoC 2017

Woah, Google Summer of Code with LibreOffice ( x2 ). This time, I’ll be working on improvement and rework of Special Characters feature in LibreOffice and adding some enhancements to it. I will be mentored by Samuel Mehrbrodt, Thorsten Behrens, and Heiko Tietze. I’ll encapsulate all the proposed changes with respect to the project in this blog.

The Idea

  • Create a way to quickly re-use recently-picked special characters, allowing the user to search in the whole character map, which has no filter to narrow down results.
  • Allow users to create their own ‘Special Characters’ subset (Individualization)
  • Sorting by last in, first out; items from the list of recently used characters are sorted to the beginning if selected.
  • Create a toolbar dropdown control to easily access recent symbols and the user-defined custom subset.
  • Have a preview along with the Unicode name.
  • Better UI for search (within font subsets) using Unicode name, hex and decimal code.
  • Different subsets within a font need a separation in the special character SvxShowCharSet custom widget.
spclchar

Finalized enhancements for the dialog

Proposal for the toolbar dropdown for quick access to favorites and recently used characters.

spclchar2

Design for the toolbar dropdown.

A lot of challenges need to be addressed while working on this project. It’s about time to play with Unicode data and custom-widgets.

For other queries and discussions, please comment or ping me (IRC nick: Akki) on libreoffice-dev / libreoffice-design channel on Freenode.