Sitecore Content Tagging & Azure Text Analytics

To implicitly profile / understand site users, the more we know about our content the better. And by the more we know, what I really mean is taxonomy and meta data associated with our content that we can analyse to draw educated conclusions about the nature of the content our site users are interested in and the tasks they are trying to accomplish.

In the best case scenario, all of our site content would be nicely tagged with relevant and well designed meta data, controlled and uncontrolled vocabulary and organised within a carefully designed and effective information architecture. But as we know that is often not the case, it’s simply too much of a hands on and ongoing effort for most teams and in a lot of cases content is also produced by third party agencies or freelancers who may not be as dialled in to the nuances of the data model and entities that are important to the organisation.

As developers, this is an opportunity to put our thinking caps on to see where technology can help.

All three major cloud providers (Google, Amazon, Microsoft) provide machine learning based content tagging services and they all look pretty similar. The premise is this, give us your unstructured content, we’ll analyse it and send you back structured data in the form of sentiment analysis, recognised entities (people, places, organisations, objects), recognised topics, key phrases etc. The focus of this post is on Microsoft’s Text Analytics service, which is part of Cognitive Services.

The fastest way to get started and see what Text Analytics can do for you is to visit the home page and provide the demo with some of your content:


But I’m going to jump straight into how we can integrate this service with Sitecore. The good news is that Microsoft offers a free tier with Text Analytics, so assuming you have an Azure account (if not sign up for a free trial) go ahead and follow the documentation to create a Text Analytics resource in your Azure subscription, which will give you a Text Analytics service endpoint for your region and the API keys you need to work with the service.

I’m going to integrate Text Analytics by subscribing to the publish begin event so as editorial teams make changes to and publish new content, it’s automatically tagged up before it makes it to the content delivery roles. So first we need a handler:

First I add a handler to the publish begin event. In the handler, I extract the item being published and then call an Item extension method, which takes care of the rest:

public static void AnalyseTextAzure(this Item item)
    var api = new AzureApi();

    var keyPhraseAnalysis = api.SendKeyPhraseAnalysis(item);

    if (keyPhraseAnalysis != null)
        var keyPhrases = new List();

        foreach (var document in keyPhraseAnalysis.Documents)

        item["Azure Key Phrases"] = string.Join(",", keyPhrases);

    var linkedEntitiesAnalysis = api.SendLinkedEntitiesAnalysis(item);

    if (linkedEntitiesAnalysis != null)
        var linkedEntities = new List();

        foreach (var document in linkedEntitiesAnalysis.Documents)
            linkedEntities.AddRange(document.Entities.Select(x => x.Name));

        item["Azure Linked Entities"] = string.Join(",", linkedEntities);

To explain this. Azure Text Analytics has separate APIs for getting linked entities and key phrases in relation to a piece of text. So for the item being published we are calling both of these APIs and storing the responses in separate fields on the item.

A couple of things to note:

  1. Before sending text to Text Analysis we must convert it to plain text, so if you are sending rich text content you will need to convert, handily Sitecore has a utility method to do just this TextUtil.StripHtml
  2. We can send up to 1000 documents at a time for analysis. You may choose to bundle fields on the item and select fields on component data sources into a single document or send fields on the item and fields on the data sources as separate documents, it depends on how you structure your content. In either case 1000 documents should be way more than enough
  3. Each document you send to Text Analysis counts as a transaction, your free tier is based on up-to 5000 transactions every 30 days. So think about this when choosing how to bundle your content
  4. The maximum length of an individual document is 5000 characters
  5. Text Analysis works better with larger volumes of text, so take this into consideration when thinking about how you want to package up your content into documents
  6. Text Analysis works better with English currently, so expect better results with English. You can send non-english content for analysis but you will need to map Sitecore’s culture codes to the codes expected by Azure

To take a closer look at one of these APIs:

public KeyPhraseResponse SendKeyPhraseAnalysis(Item item)
    var request = BuildRequest("keyPhrases");

    var textAnalysis = new TextAnalysisRequest
        Documents = BuildDocuments(item)


    return Execute(request);

The first thing it does is build a web request using the RestSharp library available via Nuget, this pulls things like the Text Analysis base url, API keys and key phrase analysis endpoint from include file settings. It then goes on to build an object (TextAnalysisRequest) that can be serialized to Json in the format that Azure Text analysis expects and contains the content we want to analyse, which is a combination of item level field values and selected field values on select renderings added to the final layout of the item (components). In this example, each component is sent as a separate document:

protected virtual List BuildDocuments(Item item)
    var documents = new List

    documents.AddRange(item.ExtractRenderings(new List { ID.Parse("{A2D485F5-F114-402E-BA36-25216401CD6A}"), ID.Parse("{41104F41-7136-4A18-B673-6D755566FA73}") }).Select(BuildDocument));

    return documents;

protected virtual TextAnalysisDocument BuildDocument(Item item)
    return new TextAnalysisDocument
        Id = item.ID.ToString(),
        Language = "en",
        Text = ItemToText(item, new Dictionary
            { ID.Parse("{F9901CAC-A69D-47B2-85BA-C210C357C252}"), new List { ID.Parse("{AD3BB97D-5816-4A06-A0F5-EF899C6C6766}"), ID.Parse("{17B05F6B-2E3D-4F3D-B224-C8E7BFC7656F}") } },
            { ID.Parse("{3B02F9D9-8E2C-4136-BEE8-57F6295E956D}"), new List { ID.Parse("{FBAB79F7-2805-4AB9-B869-96CFC31F061D}") } },
            { ID.Parse("{E5B6503F-C4D5-4458-8066-17C11115C871}"), new List { ID.Parse("{93A5C271-9687-463E-95B2-FF5F2485FC5C}") } }

protected virtual string ItemToText(Item item, IDictionary templateToFieldMappings)
    var sb = new StringBuilder();

    if (templateToFieldMappings.ContainsKey(item.TemplateID))
        foreach (var fieldId in templateToFieldMappings[item.TemplateID])
            sb.Append(item[fieldId] + " ");

    var rawText = sb.ToString();

    if (string.IsNullOrWhiteSpace(rawText))
        return string.Empty;

    // azure text analysis currently only allows up to 5000 characters per analysed document

    return TextUtil.StripHtml(rawText.Length < 5000 ? rawText : rawText.Substring(0, 5000));

There's a bunch of hard coded GUIDs in here (NEVER DO THIS!!!), which is partly why this code's not on GitHub yet but I will sort that out soon and make the whole thing available to clone/download, it's Helix architecture based so should be OK to include it any solution. The concept is fairly straight forward, given the page item being published and the data source items associated with select renderings on the page item (using the ExtractRenderings extension method),  get the fields that map to the items template (the fields we want to send for analysis), concatenate them together, strip out the HTML and return the text up-to 5000 characters.

So now every page that gets published is being sent off to Cognitive Services and its machine learning based algorithms to analyse the content and automatically tag items. This data can then be used to drive personalisation, automatic profiling (see my Profile Mapping module) or in any way you see fit for your needs.

I haven’t spent a whole bunch of time looking at it but Sitecore 9.1 comes with an out of the box integration with Open Calais , which includes content tagging abilities. As soon as I get some time I will look deeper into that and report back with findings.


Adventures in Sitecore Partial Html Cache Clearing

Sitecore offers us the HtmlCacheClearer and the HtmlCacheClearAgent. The former will clear the entire HTML cache for sites defined within the handlers sites property at the end of a publish, the latter is disabled by default in the scheduler but if enabled will clear the entire HTML cache for every site hosted in the solution based on the interval you specify.

Neither of these are particularly desirable. Clearing the entire HTML cache at the end of every publish can massively reduce the performance boost you can gain from using HTML cache if publishing is a regular activity. It also clears caches for every site configured regardless if there were changes to the sites content or not.

Clearing the entire HTML cache for every site on a schedule has some advantages in that we can hold onto the cache for longer (based on the interval you set, which can be loosely tied to acceptable delays in seeing content changes go live) but again it’s indiscriminate in that it clears the cache whether the content has changed or not. Scheduled clearance can also create a “hill effect” in CPU usage on your content delivery servers.


The spikes you can see above are a direct consequence of setting the HTML cache to clear every 30 minutes for a site that makes heavy use of HTML cache. This can be compounded if you have multiple content delivery servers all clearing at the same time and it’s extremely difficult (perhaps impossible) to synchronise the interval based scheduler across delivery servers to try and offset them so this solution also does not scale very well at all.

And what about deployments, if we go down to a single delivery server while upgrading the other the spikes after cache clearance can get pretty uncomfortable.

Well the good news is we can get rid of this problem by extending Sitecore so that it only clears the HTML cache for the sections of the site we specify, within the sites we specify and when we specify.

First I’ll point out how to access the HTML cache for a given site:

// skipping null checks for brevity
var cache = Factory.GetSiteInfo("siteName").HtmlCache;

HtmlCache inherits from Sitecore.Caching.CustomCache and if you’ve ever created your own cache before by inheriting from CustomCache you will know that you need to generate a cache key to add an entry.

Sitecore generates this cache key for every rendering in the mvc.renderRendering pipeline and GenerateCacheKey processor and the cache key is a concatenation of (depending on the caching settings on the rendering item):

  • Controller name
  • MVC area
  • Data source path
  • Language
  • Querystring
  • Rendering parameters
  • Device

And so on. If you want to see all of the keys that are currently in the HTML cache to understand the format you can call:

var keys = cache.InnerCache.GetCacheKeys();

If you want to customize the way cache keys are generated or add additional context to the cache keys you can replace Sitecore’s GenerateCacheKey processor with your own. For example if you are working with dynamic pages (or virtual urls) and want to be able to cache the renderings on these pages even though they don’t have a traditional data source to vary by, you could do something like:

public class GenerateDynamicCacheKey : GenerateCacheKey
    protected override string GenerateKey(Rendering rendering, RenderRenderingArgs args)
        var key = base.GenerateKey(rendering, args);

        // get a dynamic identifier i.e. product id

        return AppendProductId(key);

Now we can set renderings to cacheable on dynamic pages which vary by the product being displayed.

This is where things get interesting. As well as being able to clear the entire HTML cache for a site by calling cache.Clear()CustomCache, which HtmlCache inherits from exposes two additional methods:

  1. public virtual void RemoveKeysContaining(string value);
  2. public virtual void RemovePrefix(string prefix);

The names of these methods should make it fairly self-explanatory what they do and if we think back to the example of appending the product id to the end of the cache key well now we’re in a position where we could add a handler to the publish:end:remote event that extracts the item just published, check to see if it’s a product and if so remove the HTML cache entries with keys containing the product id:

protected void OnPublishEnd(object sender, EventArgs args)
    // null checks skipped for brevity
    var item = (Event.ExtractParameter(args, 0) as Publisher).Options.RootItem;

    if (item.IsProduct()) // extension method
        Factory.GetSiteInfo("siteName").HtmlCache.RemoveKeysContaining(item["Product Id"]);

So now we can hold on to the precious HTML cache for all of our product pages and only clear the cache for individual products when they’re updated.

Conversely we may still want to clear all non-product related caches on a schedule. First let’s take a look at what RemoveKeysContaining actually does. Well it calls InnerCache.RemoveKeysContaining(value). InnerCache is a property that implements ICache, which turns out to be Sitecore.Caching.Cache, which in turns inherits from Cache. The RemoveKeysContaining method in Sitecore.Caching.Cache looks like this:

public void RemoveKeysContaining(string keyPart)
    base.Remove((string key) => key.IndexOf(keyPart, StringComparison.InvariantCultureIgnoreCase) > -1);

So essentially RemoveKeysContaining is really a helper method on the Cache class, which internally creates a predicate that gets passed into the Remove method of the generic cache class it inherits from.

Ok, great, so this means we can create our own predicate. Well how about when we generate our own cache keys as described earlier we add an identifier to all product related caches $”{productId}.product“.

This would enable us to do something like the following in our scheduled, non-product related cache clearer:

// agent registered in the scheduling section of the Sitecore config
// siteName(s) to clear could be defined in the agent config
public void Run
    var cache = Factory.GetSiteInfo("siteName").HtmlCache;
cache.InnerCache.Remove((string key) => !key.EndsWith(".product"));

That simple. So now we have individual product page caches clearing only when the product is modified and non-product related caches clearing based on an interval.

Hopefully it’s evident now that using a combination of:

  1. Customised HTML cache keys
  2. Custom predicates for removing specific entries based on their key

You can really begin to fine tune your HTML cache clearing strategy to to clear entries only when it’s absolutely necessary.

Happy optimization!

Sitecore Publishing Service Field Casing

There is a bug confirmed by Sitecore support for version 2.2.1 (revision 180807) of the publishing service, where changes to fields where the only change is the casing of the field value are not promoted to publishing targets. This is a bit of an edge case but it can manifest itself if for example you are storing class/assembly names and newing up objects using Sitecore’s ReflectionUtil:

var instance = ReflectionUtil.CreateObject(
                new object[] { }) as IMyInterface;

Now let’s say someone went against class naming conventions and created a class called JSONParser and then down the line that class got renamed to JsonParser, well the publishing service would not promote that change to your publishing targets and object instantiation on the delivery servers would begin to fail.

The issue is due to the collation of the database (Latin1_General_CI_AS).

They did supply a working patch that is pretty straight forward to apply, you only need to modify the publishing service itself (1 dll, 1 xml config). Support ticket number 290996.

5 Ways to Learn Sitecore Development

Even though Sitecore Symposium was 3 weeks ago I’m still processing the experience and the takeaways I got from the event, every year is always different in this regard. One of the things that struck me this year was the broad levels of experience within the development community. Many of us have been doing this for over 10 years, some of us 15+ but as the platform continues to grow and thrive there’s a whole wave of developers relatively new to the platform coming through. Which basically inspired this back to basics guide on what I find to be the best ways to grow your Sitecore development skills. If you’re getting started or looking to upgrade those skills, I hope this helps!

1. Learn from Sitecore

So basically Sitecore is built on Sitecore, which means chances are the customisation/extension that you are trying to build already has solid examples in terms of technical approach and best practices that Sitecore are using themselves. This is true of:

  • Pipelines
  • Event handlers
  • Scheduled tasks
  • DI patterns
  • Ribbon buttons
  • Workflow actions
  • Patch config files
  • Rules engine conditions and actions

This is certainly not an exhaustive list and of course this means that a decompiler is your friend. Planning on extending a pipeline with a custom processor? Take a look at an existing processor in the pipeline, what is it doing? What is it NOT doing? What kind of properties are available on the pipeline args?

2. Patch All The Things

Or to put it another way, know what you changed and have an easy way to turn it off if you need to. It can’t be stressed enough that any modifications you make to Sitecore should be applied to Sitecore using patch files. This has huge benefits in terms of troubleshooting, change control, environment specific transforms, transparency, upgrades and so on. There are plenty of resources out there that break down Sitecore’s patching syntax. Why is this about learning Sitecore development? Because starting off with best practices is a lot easier than “un-learning” bad practices and modifying Sitecore’s configuration files directly is up there with worst of bad practices.

3. Master Sitecore YouTube Channel

We all learn in different ways, some people are visual, some prefer to read and of course some prefer to digest video content. There are so many great ways to consume Sitecore training materials in video format including:

  • The Sitecore Virtual User Group Conference
  • Live streams of user groups from all over the globe
  • Sitecore’s official YouTube channel

I single out Master Sitecore because it seems increasingly that video content from other sources is being aggregated on this channel, for example live recordings from this years SUGCON in Berlin. I highly recommend you check out some of the material on here as it’s contributed from a wide range of sources in the community.

4. Sitecore Documentation

Ok, this might seem like an obvious thing to mention BUT Sitecore have come on leaps and bounds in the last few years in the quality of the documentation they provide and we all acknowledge the awesome work being done by Martina Welander to make documentation a first class citizen at Sitecore. What’s interesting is that as well as the general documentation that’s available on there are “documentation portals” for specialised subjects such as:

And the information provided in those is definately well worth a read. We have also recently had the promise from Sitecore that the move to two releases a year means that every release will be a “complete” release, which includes full documentation.

One other thing to pay some attention to from a documentation perspective is the release notes on for the core CMS and optional modules, this in combination with the known issues sections can provide some much needed insight to what you can expect from the platform.

5. The Sitecore Community!

Last but by no means least, if you haven’t noticed already Sitecore has an incredible network of community contributors all over the globe on various channels:

  • Individual blogs
  • Sitecore Stack Exchange
  • Twitter (#Sitecore, #SitecoreUG, #SitecoreSym, #SitecoreMVP…)
  • Open source projects
  • Slack
  • User groups
  • Conferences

The insights being shared by the community are invaluable, get involved! One way I found to get involved in the community initially was to think of parts of Sitecore that I wanted to learn more about and build a small module that does something useful and uses those parts of the platform… and then.. well, share it. It’s a good confidence booster, which can also launch you into things like speaking at user groups. The community is always looking to hear from new speakers (as well as the familiar faces).

Happy Sitecore… ing.

processItem Pipeline in Relation to FXM

I’ve talked a lot recently about the processItem pipeline and how it can be extended to automatically associate site visitors to Sitecore profiles without having to tag all of your page items in the content tree, in fact it was the theme of my Sitecore Symposium talk in Orlando.

What I have discovered fairly recently is that FXM (Federated Experience Manager) uses the exact same pipeline to register goal conversions, page events etc. on external sites. If you haven’t heard of FXM before, essentially FXM allows you to do “Sitecore like things” on sites that are not running on Sitecore i.e.

  • Track user activity
  • Inject Sitecore managed content into external sites
  • Personalise content injected from Sitecore on external sites

Not an exhaustive list but you get the point. It’s not a silver bullet however and it’s far from perfect but let’s say you want to use the FXM APIs in a non-website context, let’s say mobile apps, to track user activity and goal conversions or do the same kind of profile mapping that I talk about in my Symposium talk. Well, because FXM uses the processItem pipeline within a page tracking context you absolutely can, even though FXM has no back-end UIs that will allow users to manage tracking in a non-website context.

So how exactly? Well firstly when you create an external site in FXM you end up with a “Domain Matcher” item, which defines mappings between the external sites host name and the FXM site it relates to and language mapping rules. This allows you to tell Sitecore how to intepret FXM requests from external sites with the freedom to define how the external site provides that information, could be in the request headers for example.


As you begin to set “filters” for the external site using the FXM version of the experience editor (not available for mobile apps of course), for example where the external site address being tracked equals a specific page then register the goal conversion “subscribed to newsletter”, filter items are created as children of the domain matcher. The goals to register etc. are then stored against the tracking field of the child filter item.

The RegisterEventsForMatchedPagesProcessor within the tracking.matchpages pipeline defined in Sitecore.FXM.config then basically iterates over the child filter items that match the current FXM external site tracking request, creates an instance of ProcessItemArgs and passes it into the processItem pipeline.


So again, it would be a relatively trivial thing to extend either the FXM tracking.matchpages or processItem pipeline to read from a repository of rules based profile map items to associate the current visitor with Sitecore profiles even within a non-Sitecore and non-website context.

I am currently working on a project where we will quite likely take this approach, I will update this post with my findings.

Happy profiling!

Sitecore Symposium 2018 Slides


I had a great time presenting my talk “Taxonomy & Rules Based Profile Mapping” at this years Sitecore Symposium in Orlando, the best comment I had after the talk was a gentleman who came over after and said “thanks so much, that was one of the best developer talks of the conference so far”, it really validated all the hard work and consideration that goes into these things.

I’ve uploaded the slides from the talk for your reference as a pptx so you have the notes, get your download on:

Taxonomy & Rule Based Profile Mapping

The link to the GitHub repository containing the source code is in the slides but if you just want that you can grab it here. I still need to work on the documentation but it’s coming.

It’s been two years in a row speaking at Symposium (last year was mobile app development with the Sitecore Xamarin SDK), let’s see if I can make it three (wherever it may be)!

Sitecore Sunday Pt 4: Conditional (Rules Based) Field Validators

Sitecore’s OOTB field validators are used on pretty much every Sitecore implementation I’ve seen in the last 10 years, which makes it a pretty useful feature! However there are times when there is a little more context surrounding whether or not a particular field is valid.

Thankfully Sitecore makes it super simple to create custom field validators where we have reign to do pretty much whatever we want and there have been some interesting ones put out there for example this image width, height and aspect ratio validator by Brian Pedersen.

I will be calling this approach IFTTV (If This Then Valid).

If This Then Valid


To setup field validator items that use the rules engine we first of course need a template with the necessary validation fields. Sitecore actually already provides this template:

/sitecore/templates/System/Validation/Rules Validation Rule

However this template is intended for use with item level validation that reads from rules under the item:

/sitecore/system/Settings/Rules/Validation Rules/Rules

I feel this approach is disconnected from the management of field validation rules under:

/sitecore/system/Settings/Validation Rules/Field Rules

But we can still take advantage of the template. The first step is to create a template that inherits from the Rules Validation Rule template and specifies our field level rules implementation in the type field.

If This Then Valid Template

The implementation is fairly straight forward:

using Sitecore.Data.Validators;
using Sitecore.Rules;
using Sitecore.Rules.Validators;
using System.Runtime.Serialization;

public class IfThisThenValid : StandardValidator
    public override string Name => "If This Then Required";

    public IfThisThenValid()

    public IfThisThenValid(SerializationInfo info, StreamingContext context) : base(info, context)

    protected override ValidatorResult Evaluate()
        var item = GetItem();

        if (item == null)
            return ValidatorResult.Valid;

        var ruleContext = new ValidatorsRuleContext
            Item = item,
            Validator = this,
            Result = ValidatorResult.Valid,
            Text = string.Empty

        var validatorItem = item.Database.GetItem(ValidatorID);

        if (validatorItem == null)
            return ValidatorResult.Valid;

        var rules = RuleFactory.GetRules<ValidatorsRuleContext>(new[] { validatorItem }, "Rule");

        if (rules == null)
            return ValidatorResult.Valid;


        if (ruleContext.Result == ValidatorResult.Valid)
            return ValidatorResult.Valid;

        base.Text = ruleContext.Text;

        return base.GetFailedResult(ruleContext.Result);

    protected override ValidatorResult GetMaxValidatorResult()
        // Critical or above (Fatal) is important for experience editor support

        return base.GetFailedResult(ValidatorResult.CriticalError);

To summarise:

  • Inherit from Sitecore’s StandardValidator
  • Get the item being validated (rules are run against this as the context item)
  • Get the validator item (this is where our rules live)
  • Get the rules in the validator item and run them against the context item using a ValidatorsRuleContext (which returns a ValidationResult). This is where the difference lies from Sitecore’s standard implementation*, which evaluates the context item against the rules under: /sitecore/system/Settings/Rules/Validation Rules/Rules
  • Return the result


Then I will create a folder under /sitecore/system/Settings/Validation Rules/Field Rules and an insert options rule so we can create new rules based field validators and organise them in folders:

If This Then Valid Insert Options

As an example I added two fields to the Sample Item template, Sitecore Image and External Image, the validation scenario is at least one of them must be provided. Here’s our rule based validator:

If This Then Valid Rule

Which is associated with our fields on the sample item:

If This Then Valid Field

And here’s it in action:

If This Then Valid Usage

And that’s it, pretty straight forward but a ton of uses, especially considering the extensibility of the rules engine.

Happy validating!