Tag-style exact-matching multiple strings when searching with Examine in Umbraco

Searching for exact strings in Examine can be a little counter-intuitive. Couple that with the confusion of multiple values (potentially using the Repeatable text strings or tag data types) and inconsistencies between the backoffice tooling and the fluent API and you've got a right potential mess on your hands!

, by Joe Glombek

The situation is this: I have some content nodes representing Meetup groups which have a repeatable text string field on them to list the organisers of the meetup. In my case, these are people's names but they could just as easily be blog categories. This content needs to be filterable using an Examine search query. In my case, I want to list out meetups which are run by "Joe Bloggs", but again, this could easily be listing out blog posts in each tag.

I'd have the same issue if these repeatable text strings were using the tags data type instead, but a slightly different issue if they were picked content (indexing the IDs or UDI makes this a lot easier!)

How an Examine search usually works

A typical search function may look like this:

// Get the index
if (!_examineManager.TryGetIndex("ExternalIndex", out var index))
{
    return (Enumerable.Empty<MeetupDetailPage>(), 0);
}

// Create a query (some extension methods are from the Search Extensions package)
IBooleanOperation query = index
    .Searcher
    .CreatePublishedQuery()
    .And().NodeTypeAlias(MeetupDetailPage.ModelTypeAlias)
    .And().ParentId(parentId)
    ;

if (organisers?.Any() == true)
{
    query.And().GroupedAnd("organisers".AsEnumerableOfOne(), organisers);
}

var res = query.Execute();
// etc.

If I run the above code, the Lucene query generated will be something along the lines of:

+__Published:y -umbracoNaviHide:1 -templateID:0 +__NodeTypeAlias:meetupdetailpage +(parentID:[1000 TO 1000]) +organisers:Joe Bloggs

This will return groups organised by "Joe Smith" and "Jane Bloggs" as well as the desired "Joe Bloggs"-organised endeavours!

Exact match

Examine gives me the option of specifying I want this to match exactly, by using the .Escape() extension method:

query.And().GroupedAnd("organisers".AsEnumerableOfOne(), organisers.Select(x=>x.ToLower().Escape()).ToArray());

Since the "terms" within the values in the index will be lower-cased, we also have to lower case the search query.

However, this also has some downsides. If I have a "Where American Politics meets Umbraco" meetup organised by "Elizabeth Warren" and "Warren Buckley", searching for "Warren Warren" (cruel parents, I know!) would return meetups run by the two former organisers as well since by default Examine strips out whitespace.

The value is stored as a series of tokens, elizabeth, warren, warren and buckley. The .Escape() extension ensures the tokens are in the specified sequence, but not that they are different names.

Examine will also split out "stop words" - filler words in a common search phrase - such as "of", "and" and "the". Meaning "The Umbraco Rabbit of Codegarden Fame" gets tokenized as umbraco rabbit, codegarden, fame which an exact query will never find!

None of that tokenisation nonsense here, thank you very much!

A cleaner way to store this value is to not allow Examine to split it into smaller tokens. Using the Raw value type, Examine will treat the name as an atomic Joe Bloggs.

To do this, we can make use of the IConfigureNamedOptions interface:

public class ConfigureExternalIndexOptions : IConfigureNamedOptions<LuceneDirectoryIndexOptions>
{
    private readonly ILoggerFactory _loggerFactory;

    public ConfigureExternalIndexOptions(ILoggerFactory loggerFactory)
    {
        _loggerFactory = loggerFactory;
    }
    public void Configure(string name, LuceneDirectoryIndexOptions options)
    {
        if (name is Constants.UmbracoIndexes.ExternalIndexName)
        {
            // Setting this value to raw saves each name as its own term e.g. "The Umbraco Rabbit of Codegarden Fame", rather than "umbraco" "rabbit" "codegarden", "fame" which is default.
            options.FieldDefinitions.AddOrUpdate(new FieldDefinition("organisers", FieldDefinitionTypes.Raw));
        }
    }

    public void Configure(LuceneDirectoryIndexOptions options)
    {
        // We don't need this bit
        throw new System.NotImplementedException();
    }
}

The downside here, is that multiple values are now atomic too. That meetup organised by Elizabeth and Warren we mentioned earlier would be indexed as Elizabeth Warren Warren Buckley and would only be returned when searching for that exact phrase.

So we need to split these values into multiple atomic values.

Splitting the value into multiple values

Examine provides a very handy event called TransformingIndexValues that allows us to manipulate values entering the index.

To access this, we need to create a component, listen to the TransformingIndexValues event on the index provider, and split the value into its multiple parts.

public class MeetupIndexingComposer : ComponentComposer<MeetupIndexingComponent>
{
}

public class MeetupIndexingComponent : IComponent
{
    private readonly IExamineManager _examineManager;

    public MeetupIndexingComponent(IExamineManager examineManager)
    {
        _examineManager = examineManager;
    }
    public void Initialize()
    {
        if (!_examineManager.TryGetIndex(Umbraco.Cms.Core.Constants.UmbracoIndexes.ExternalIndexName,
                out var index))
        {
            return;
        }

        if (!(index is BaseIndexProvider indexProvider))
        {
            return;
        }

        indexProvider.TransformingIndexValues += IndexProviderOnTransformingIndexValues;
    }

    private void IndexProviderOnTransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (e.ValueSet.Category == IndexTypes.Content)
        {
            foreach (var value in e.ValueSet.Values)
            {
                if (value.Key == "organisers")
                {
                    var updatedValues = e.ValueSet.Values.ToDictionary(x => x.Key, x => x.Value.ToList());

                    // Here we split by new line, to split repeatable text strings. You might need to split by something else. Comma, maybe?
                    var values = value.Value.FirstOrDefault().ToString().Split(Environment.NewLine);
                    updatedValues["organisers"] = values.Cast<object>().ToList();

                    e.SetValues(updatedValues.ToDictionary(x => x.Key, x => (IEnumerable<object>)x.Value));
                }
            }
        }
    }

    public void Terminate()
    {
    }
}

(Thanks to Mac Love for pointing out how to save these values in Umbraco 10+!)

Our tokenized values are now Joe Bloggs or Elizabeth Warren, Warren Buckley! Now, how do we search for them?

Searching against multiple values

This is now even simpler than before:

query.And().GroupedAnd("organisers".AsEnumerableOfOne(), organisers);

Which generates a query ending:

+organisers:Joe Bloggs

The Examine query parser will treat this nicely and only return meetups where the exact phrase "Joe Bloggs" is entered as the entire line of the repeatable text strings field.

Searching in the backoffice

It's worth noting that this query won't work in the handy backoffice dashboard (/umbraco/#/settings?dashboard=settingsExamine) to test Lucene queries.

This is due to the fact that the backoffice doesn't use a query analyser.

You'll need to modify your generated queries to add double-quotes around each value like so:

+organisers:"Joe Bloggs"

Further reading

If you're doing something like the above example and need to generate a filter with all the possible values of organisers (or tags or whatever), it might be worth looking into faceting.

As it happens, Jesper was grappling with the same issue recently, but came up with a different solution, which he also blogged about.

The Umbraco documentation for searching is getting more comprehensive too.

Search Extensions has some useful extension methods for using with Examine searches, as well as some sensible defaults (filtering by ancestors using the path field, for one). (A little bird told me faceting may be coming to Search Extensions in a release near you.)

Luke is a useful tool for peeking inside your index. It's a Java tool that now comes bundled with the Java version of Lucene. You'll need to use the Luke version which matches the version of Lucene.Net used by Umbraco - 4.8.0 at the time of writing. You'll need the JDK to run it. Also, don't try to run this on a monitor with any scaling - 1080p at 100% works best!