Improving CosmosDB Test Automation Reliability with Retry Logic

You will find that CosmosDB Emulator will fail, randomly, for no apparent reason doing simple things like get an instance of a container or create the database. In the world of cloud, it’s important to handle Transient Faults, or errors that are not repeatable or consistent in when the appear.

They might look like this:

Business.Tests.BuildingTests.CreateNewBuilding [3s 650ms]

Error Message:

Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: 500 Substatus: 0 Reason: (Microsoft.Azure.Documents.DocumentClientException: Unknown server error occurred when processing this request.ActivityId: b55227fb-15d1-4506-a685-c6b7751271c3, Microsoft.Azure.Documents.Common/2.9.2, {“RequestStartTimeUtc”:”2020-03-22T04:41:43.6841102Z”,”RequestEndTimeUtc”:”2020-03-22T04:41:44.0570433Z”,”RequestLatency”:”00:00:00.3729331″,”IsCpuOverloaded”:false,”NumberRegionsAttempted”:1,”ResponseStatisticsList”:[],”AddressResolutionStatistics”:[{“StartTime”:”2020-03-22T04:41:43.6842253Z”,”EndTime”:”2020-03-22T04:41:44.0570433Z”,”TargetEndpoint”:”https://192.168.231.161:8081/dbs/mydb/colls”}],”SupplementalResponseStatistics”:[],”FailedReplicas”:[],”RegionsContacted”:[],”ContactedReplicas”:[]}, Windows/10.0.14393 cosmos-netstandard-sdk/3.4.2 at Microsoft.Azure.Cosmos.GatewayStoreClient.ParseResponseAsync(HttpResponseMessage responseMessage, JsonSerializerSettings serializerSettings, DocumentServiceRequest request) at Microsoft.Azure.Cosmos.GatewayStoreClient.InvokeAsync(DocumentServiceRequest request, ResourceType resourceType, Uri physicalAddress, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)).

Stack Trace:

at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()

at Microsoft.Azure.Cosmos.CosmosResponseFactory.ProcessMessageAsync[T](Task`1 cosmosResponseTask, Func`2 createResponse)

at Microsoft.Azure.Cosmos.DatabaseCore.CreateContainerIfNotExistsAsync(ContainerProperties containerProperties, Nullable`1 throughput, RequestOptions requestOptions, CancellationToken cancellationToken)

at Common.DataAccess.BaseDataAccess.GetContainerAsync() in D:a1sCommonDataAccessBaseDataAccess.cs:line 27

at DataAccess.BaseCrudDataAccess`1.CreateAsync(T entity) in D:a1sCommonDataAccessBaseCrudDataAccess.cs:line 24

at Business.BuildingRepository.CreateAsync(BuildingDetail entity) in D:a1sLocation.BusinessBuildingRepository.cs:line 88

at Business.Tests.BuildingTests.CreateNewBuilding() in D:a1sLocation.Business.TestsBuildingTests.cs:line 26

I’ve done that using some simple retry logic that I found from a great post on Stack Overflow that I modified to include async operations (DoAsync and DoAsync<T> where T is the return type of the async method).

28fab14b38d04c82aa26be2100a537c7

then the one that has a return type:

cbf18cf2139044f688688cffdfd7111b

Example usage, in he cosmos DB setting is this:

Just adding some simple re-try logic around my Cosmos DB data access code drastically improved the reliability of automated tests.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s