Write it down once NET analysis of a drug warehouse management system stuck

Posted by dub on Sun, 23 Jan 2022 05:27:10 +0100

1: Background

1. Tell a story

At the beginning of this month, a friend came to me on wx and said that after a period of time, there would be only requests but no responses. The screenshot is as follows:

From my friend's description, it seems that the program is stuck by something. It's relatively easy to solve this problem. Next, I'll analyze it with windbg.

2: Windbg analysis

1. What is the request doing?

Since my friend said that the api has a request but no response, how can I verify whether my friend's words are correct? We all know NET uses HttpContext to represent a request. The implication is that you can grasp the duration attribute under HttpContext. There is one in Netext! The whttp command can help us.

0:000> !whttp
HttpContext    Thread Time Out Running  Status Verb     Url
000000563bf803b0   42 00:01:50 00:01:24    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
000000563bf84660   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/
000000563c4a0470   51 00:01:50 00:00:12    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
00000056bbf63590   30 00:01:50 00:02:41    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
00000056bc82a038   -- 00:01:50 Finished    200 GET      http://localhost:30003/
00000056bc84a3e8   44 00:01:50 00:00:51    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
00000056bc8671c8   46 00:01:50 00:00:45    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
000000573bf44698   35 00:01:50 00:02:39    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
000000573bf483c0   33 00:01:50 00:02:41    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
000000573bf97e80   40 00:01:50 00:02:32    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=ZJB2C
000000573c583b08   -- 00:01:50 Finished    200 GET      http://localhost:30003/
000000573c589ec8   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/Wms/xxx/xxx/xxx
000000573c760e28   -- 00:01:50 Finished    200 POST     http://xxx.com:30003/Wms/xxx/xxx/xxx
000000573c95f990   48 00:01:50 00:00:31    200 POST     http://xxx.com:30003/Wms/Common/xxx?xxx=xxx&xxx=x-HN
00000057bbf4f8e8   31 00:01:50 00:02:12    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
00000057bc080340   50 00:01:50 00:00:19    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
000000583c4aee80   43 00:01:50 00:01:11    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
000000583c4d0c50   53 00:01:50 00:00:01    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
00000058bbf8f1a0   34 00:01:50 00:02:22    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
000000593bfe1758   41 00:01:50 00:01:22    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
000000593c892160   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/Wms/xxx/xxx/xxxJob
000000593ca813b0   45 00:01:50 00:00:30    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-HN
000000593caa45d8   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/
00000059bc1ad808   32 00:01:50 00:01:45    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
00000059bc1c3d70   36 00:01:50 00:01:29    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x

25 HttpContext object(s) found matching criteria

From the Running column, we can see that most requests have reached more than 1 minute, which also verifies the stuck problem mentioned by friends. According to experience, we can take the thread where the largest httpContext in the Running column is located, that is, threads 30 and 33 above to see what they are doing?

2. Explore the longest Running thread

Next, switch to threads 30 and 33 and look at their thread stacks.

0:000> ~30s
ntdll!NtWaitForSingleObject+0xa:
00007ffd`b81f024a c3              ret
0:030> !clrstack 
OS Thread Id: 0x29d0 (30)
        Child SP               IP Call Site
0000005acc3ac590 00007ffdb81f024a [PrestubMethodFrame: 0000005acc3ac590] xxx.xxx.RedisConnectionHelp.get_Instance()
0000005acc3ac850 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
0000005acc3ac8c0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
0000005acc3ac968 00007ffdabef1f7c [StubHelperFrame: 0000005acc3ac968] 
0000005acc3ac9c0 00007ffd4dd77f18 xxx.xxx.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)

...

0:030> ~33s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffd`b81f07ba c3              ret
0:033> !clrstack 
OS Thread Id: 0x3ad4 (33)
        Child SP               IP Call Site
0000005accabae90 00007ffdb81f07ba [GCFrame: 0000005accabae90] 
0000005accabafb8 00007ffdb81f07ba [HelperMethodFrame_1OBJ: 0000005accabafb8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
0000005accabb0d0 00007ffdaac60d64 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
0000005accabb160 00007ffdaac5b4bb System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
0000005accabb1d0 00007ffdab5a01d1 System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)
0000005accabb2a0 00007ffdab59cfa7 System.Threading.Tasks.Task`1[[System.__Canon, mscorlib]].GetResultxxx(Boolean)
0000005accabb2e0 00007ffd4d8d338f xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].GetConfig(xxx.Config.Model.ConfigListener, System.Func`2<xxx.Config.Request.GetConfigRequest,System.Threading.Tasks.Task`1<System.String>>)
0000005accabb340 00007ffd4d8d2f40 xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].get_Item(System.String, System.String)
0000005accabb3c0 00007ffd4dd78f7f xxx.Util.BaseConfig.get_GetRedisConn()
0000005accabb440 00007ffd4dd78e9c xxx.xxx.RedisConnectionHelp.GetConnectionString()
0000005accabb4a0 00007ffd4dd789cb xxx.xxx.RedisConnectionHelp..cctor()
0000005accabb940 00007ffdabef6953 [GCFrame: 0000005accabb940] 
0000005accabc5b0 00007ffdabef6953 [PrestubMethodFrame: 0000005accabc5b0] xxx.xxx.RedisConnectionHelp.get_Instance()
0000005accabc870 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
0000005accabc8e0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
0000005accabc988 00007ffdabef1f7c [StubHelperFrame: 0000005accabc988] 
0000005accabc9e0 00007ffd4dd77f18 xxx.Core.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)
...

It is not difficult to find that thread 30 is stuck in redisconnectionhelp get_ At instance (), line 33 has entered redisconnectionhelp get_ In the instance () method, finally wait for the Result at GetConfig(). According to experience, thread 30 seems to be waiting for the lock and thread 33 is waiting for the asynchronous Result. The next breakthrough is to explore redisconnectionhelp Code at instance.

3. Find the problem code

Next, use the decompile tool ILSpy to find the problem code.

public static class RedisConnectionHelp
{
    public static ConnectionMultiplexer Instance
    {
        get
        {
            if (_instance == null)
            {
                lock (Locker)
                {
                    if (_instance == null || !_instance.IsConnected)
                    {
                        _instance = GetManager();
                    }
                }
            }
            return _instance;
        }
    }
}

Thread 30 is stuck at the Locker. Next, dig deep into the GetManager() method executed by thread 33. The simplified code is as follows:

public T this[string dataId, string key = ""]
{
    get
    {
        try
        {
            string config = GetConfig(configListener, new NacosConfigClient(Base.NacosConfiguration).DoGetConfigAsync);

            return JsonConvert.DeserializeObject<T>(config);
        }
        catch (Exception ex)
        {
            return default(T);
        }
    }
}

private string GetConfig(ConfigListener listener, Func<GetConfigRequest, Task<string>> action)
{
    var text2 = action(new GetConfigRequest
    {
        DataId = listener.DataId,
        Group = listener.Group,
        Tenant = text
    }).Result;

    return text2;
}

internal async Task<string> DoGetConfigAsync(GetConfigRequest request)
{
    IRestResponse restResponse = await HttpUtil.Request(currentServerAddr, Method.GET, request.ParamValues(), xxx);

    return restResponse.Content;
}

You can see that the code is stuck on the Result and waits indefinitely. Here I think of the synchronization context. I think his program is NET 4.8 Net MVC, remember that the context should be AspNetSynchronizationContext. For specific deadlock reasons, see my article: A task Result is deadlock. How do you write this code? , there are about four solutions.

  1. use. ConfigureAwait(false)
  2. Change to fully asynchronous
  3. Use Task to wrap another layer.
  4. Change to full synchronization

3: Summary

In fact, this accident is mainly due to the asynchronous code in the synchronous code There are many articles attacking the deadlock caused by Result in ASP Net core has removed this pit of synchronization context. The suggestion to friends is to change to full synchronization, and the deadlock problem disappears.

Ha ha, I'm so happy for my friends!

Topics: .NET