Multilingual Search API with Entity Translation

Lost in translation

Working recently on a client's project it turned out that there is no good way of indexing Entity Translation-based multilingual content in solr using Search API in combination with Search API Solr search.

The Situation

All that was available was initial Search API Entity Translation module, offering

a minimalist approach of making multilingual content managed via entity translations searchable via Search API, (...) offering a new search API field named "Multilingual full text" which concatenates all the entity translations of a specific entity, thus making it possible to search in any content translation/language.

As you might imagine, this could work in some very basic, not really demanding situations, as soon though as you would want to introduce some level of complexity (where for instance language-aware search is already pretty high level of complexity here), this solution would not provide expected results.

There were also few sandboxes flirting with similar concepts out there in the wild:

  • GaëlG's Entity translation search API - using an alternative way of getting language aware search with entity translation by having one index per language (what, combined with Search API multi-index searches module, could be a working solution),
  • drunken monkey's Search API Entity Translation v2 - fork of Search API Entity Translation above, more advanced although still pretty minimalist, providing separate new entity properties/search API fields for each translatable field on each translatable entity,

with the latter being the most promising, introducing new datasource controller, thus allowing for a completely new approach to storing multilingual data in Search API indexes.

Unfortunately, it did not work, being more a proof of concept rather than working solution, and still requiring quite a lot of work to be done to make it do what it was supposed to do (what was, nota bene, quite clearly stated by drunken monkey in one of his comments on the Decide on strategy for language aware search issue).

Speaking of issues - as you might imagine, being able to index multilingual content translated using Entity Translation was a feat which many a developer was looking forward to. These example issues were asking for this exact functionality:

of which the last one still seems to be the most active.

Obviously this situation needed to change.

Search API Entity Field Translation

First, temporary step towards a new and better tomorrow was Search API Entity Field Translation sandbox, which forked first version of Search API Entity Translation module.

The difference between those two modules was that the older Search API Entity Translation simply offered one new Search API field concatenating all the entity translations of a specific entity (which means that translations/languages of content cannot be properly distinguished in the search), while the new Search API Entity Field Translation provided new entity properties - and thus new Search API fields - for each translatable field of a specific entity.

These new Search API fields (for example, for translatable title field on a site with ja and hk languages enabled, new ja_title and hk_title Search API fields would be provided) could then be used as solr's dynamic fields using relevant field types specific for a given language (which, additionally, opens possibility to have different field options and/or analyzers for each of those languages).

As already mentioned though this module was only a temporary step, not really providing any solution for existing problems, and as such would be most probably completely deleted soon.

Search API Entity Translation v2b

What was really needed was taking drunken monkey's sandboxed initial draft of Search API Entity Translation v2 and moving on with its development, adding all missing features and simply making sure that it works.

Initially I have been just creating issues in his issue queue for each newly added feature, pretty quickly decided though that it would be better to fork his sandbox and commit all the changes there instead, which would keep everything in one place and facilitate testing for all potentially interested. This is how Search API Entity Translation v2b was born.

New features added there (using drunken monkey's comment in Decide on strategy for language aware search issue as a starting point) included:

  • indexing in the correct language,
  • Entity Translation CRUD hook implementations,
  • hook_disable() and hook_uninstall() implementations (this was finally removed after closing Search API module issue Fix reaction to disabled modules),
  • hook_features_export_alter() implementation,
  • missing module dependency: entity_translation,
  • admin UI for the index settings,
  • re-queueing multilingual indexes with Languages to be included in the index settings option set to completed entity languages on each translation add/update/delete, translatable field add/delete, language add/delete/enable/disable (more on this below).

Re-queueing "completed" indexes

When index's Languages to be included in the index setting is set to completed entity languages (a better value names for this drop down, anyone?), then:

  • each field translation insert/delete (so, essentially, each translation update),
  • each field instance add/delete,
  • each language enable/disable

triggers a verification if new items should be added to/deleted from such an index, and, if that's the case, such index is automatically being re-queued and marked for re-indexing.

Why?

Let's illustrate it with the example: imagine there are 2 languages enabled on the site, and translatable Article node contains 2 translatable fields, and all fields are translated to all languages (so index items were created for those entities).

Now, if we add and enable a new language, or add a new translatable field to such entity, from this moment on all previously fully translated Article nodes are not fully translated anymore, as they are missing translations for the new field, or for the new language - therefore should not be in the index anymore, hence the need for re-queueing and re-indexing afterwards.

Search API Entity Translation Solr Search

Additionally to the new version of Search API Entity Translation module, a new extension module was also required for Search API Solr search, to make solr work properly with multilingual content.

This resulted in another sandbox - Search API Entity Translation Solr search module, which changes the way in which Solr Search module stores the multilingual content, making it use solr dynamic fields for translatable entity fields.

For example, when indexing body:value field, solr would normally store its value in tm_body:value field. With this hook (assuming that body field is translatable and comes from translatable entity, and that currently being processed content language is fr) it will be stored in tm_fr_body:value instead, with nothing saved to default tm_body:value field.

This allows to use different datatypes for different language-based solr fields, thus allowing to configure different tokenizers/stemmers/spell checkers/stop words/protected words/etc for each language separately.

Obviously it also works when retrieving data from solr, looking for language-based dynamic fields, and, if they exist - assigning their values back to relevant Search API field (not language-based anymore).

Additionally the module modifies solr search queries run on multilingual indexes to add all possible multilingual field variants (or, if query runs on specific languages only, add field variants for those languages only) for all translatable fields being searched (which means that when tm_title_field is being searched on site with en and fr languages enabled, the search query will automatically include also tm_en_title_field and tm_fr_title_field fields in its qf parameter).

Where do we go from here?

At the moment it seems that both modules are in more or less complete state (although there are still 2 open issues for Search API Entity Translation v2b which I have not been able to replicate locally so far and both are waiting for more information).

Also, Search API Entity Translation Solr search still requires some manual schema configuration to add relevant fieldTypes and dynamicFields for all enabled languages; for the moment though I consider it to be out of scope of this module. (However it might be an idea for another contrib module - automatically generating config files based on enabled languages and indexed fields. Anyone up for the task?)

Once it is confirmed that both modules work fine and meet their initial expectations, most probably Search API Entity Translation v2b will be merged back into original Search API Entity Translation as its version 2, while Search API Entity Translation Solr Search will either be added to Search API Entity Translation v2b as a sub-module, or just promoted to a stand-alone full project.

How can I help?

Before this happens though they need to be properly tested in as many scenarios as possible - and this is where you could help. Give them a try, see if they work in your environment, verify that you see expected results... try to break them!

You might also have some other cool features on your mind which you believe these modules are missing? Don't hesitate to give me a shout then too! (And obviously create an issue.)

Links