We encountered a recent issue where we were seeing the Resource Units (RUs) in a test environment spiking even under a relatively low load of 3-4 users. As this was due to scale to a lot more users, this naturally caused some concern and while we could possibly have thrown more money at it by increasing the RU limit in Cosmos for that database, it didn’t feel like it should be having that sort of impact. This gave a good opportunity to show how to diagnose issues in Azure.
For an overview of Azure Cosmos DB, see an earlier blog post on using Azure Cosmos DB and Azure Storage for a data repository for a serverless application.
Reviewing the Metrics
The first area that this shows as an issue is in the Azure Cosmos DB Overview page on the Azure Portal. If your activity causes the number of Resource Units per second to exceed your throughput level, you receive a message like below on the page.
Clicking on the link takes you to a chart indicating how often and when you have breached the limits recently.
Often the number of spikes here would lead to concerns around missing data, but the repository pattern used with Cosmos and C# mentioned in the earlier blog post caters for this by retrying when a 429 Http response is returned. The impact is that all calls will be taking longer and so performance of the application will be degraded, especially if this is occurring a lot. The big question is what query was causing the spikes?
Implementing Log Analytics on Azure Cosmos DB Diagnostic Logs
The great thing about using Azure, especially with the PaaS services, is that there are ways implemented to get this information. In the case of Azure Cosmos DB, a lot of this information can be pushed to the Diagnostic Logs. In the Azure Portal, Cosmos DB has Diagnostic settings in the menu, where you can enable Diagnostic logs and choose whether you want to send to a Storage account, stream to an event hub, send to Log Analytics or all three. There are three types of log:
- DataPlaneRequests – details about each request made including the Resource Unit charge
- MongoRequests – for those using Mongo
- QueryRuntime Statistics – details of the query being made
There is also the option to capture the metrics on a regular basis as well such as the total resource unit charge and number of requests. In our situation, we were only interested in the DataPlaneRequests and QueryRuntimeStatistics. For more details on setting these up, there is a great page in the Microsoft Docs.
Once this is set (or at least after about ten minutes from experience), you will see two new containers added to the designated storage account – insights-logs-dataplanerequests and insights-logs-queryruntimestatistics:
If you do not want to set up LogAnalytics then you could stop here and start reviewing the logs. However, if you want to query and manipulate the results (which is hugely helpful when diagnosing) then I would recommend also feeding to Log Analytics.
Reviewing the logs with Log Analytics
Once you have Log Analytics set up, you should see it appear in your Resource Group. The place to get started is Logs under the General part of the menu.
The query language can be pretty daunting, especially as it feels like a mix between SQL queries and scripting languages – there are some details on getting started with the language on Microsoft Docs but I have not found a large number of useful other sites out there yet. To just see the latest details, you can type AzureDiagnostics in the window and hit run for the last hour to see results returned.
To help solve the issue that we were facing, I needed to see the messages with a high request charge so ran the query:
AzureDiagnostics | where todouble(requestCharge_s) > 100 | sort by todouble(requestCharge_s) desc:
This returns a lot of different columns, but the key ones are the requestCharge_s value itself and the activityid. The request charge will show you how high the RUs are reaching for that single query, as shown below where some are peaking at almost 3000 which is not ideal for a 1000 RUs throughput!
The next question is which query is causing this?
Viewing the Cosmos Queries
At this point I have a little confession to make. I think that these results should also be showing in the Log Analytics, but I have not been able to find them! Instead, you can view the queries from the storage output. Drilling down through the insights-logs-queryruntimestatistics to the database shows a file every 5 minutes:
Opening up the JSON file that covers the period you identified from the Log Analytics will allow you to find the query used:
Unfortunately, it does not appear to show the parameters used but it gives an indication as to what might be the problem. In this case, it was returning everything not matching an ID, bringing back over 76,000 pretty large json objects unnecessarily.
Summary
With Azure Cosmos DB, you can get some detailed information on what is happening with no impact to the performance of your environment. This can really help diagnosing Azure Cosmos DB issues, and with the ability to fine tune queries to keep the cost down you have a chance to make a real difference to the return on investment immediately, compared with the days of having to split work between different servers to make any difference.
By Kevin McDonnell, Senior Technical Architect at Ballard Chalmers
About the author
Kevin McDonnell is a respected Senior Technical Architect at Ballard Chalmers. With a Master of Engineering (MEng), Engineering Science degree from the University of Oxford he specialises in .NET & Azure development and has a broad understanding of the wider Microsoft stack. He listens to what clients are looking to achieve and helps identify the best platform and solution to deliver on that. Kevin regularly blogs on Digital Workplace topics and is a regular contributor to the monthly #CollabTalk discussions on Twitter.