An interesting memory leak bug.

Posted by carlos1234 on Sun, 20 Oct 2019 17:02:52 +0200

bug review

Bug fix timeline

  • At 10:00 p.m. on x month x, the memory of X service began to rise rapidly.
  • At 10:18 on X, users reported that x service memory rose rapidly.
  • mongodb was restarted due to insufficient memory at 10:20 on x
  • Restart x service at 10:22 on X
  • At 11:00 p.m. on x month x, the memory of X service rose rapidly again.
  • At 14:45, the x service is rolled back and updated again. The x service returns to normal.
  • At 19:48 on x, it was found that the Enum.random function had a performance problem, which was fixed and tested.
  • At 20:51 on x / XX, fixed the problem of fast rising of x service memory and updated it to online.

Brief introduction of bug causes

The process pool is used to eliminate the bottleneck.
The random function (Enum.random) used to allocate processes calls Enumerable.count to get the List length.
exrm, the Elixir publishing tool used, has a bug and does not consolidate the calls. As a result, when Enumerable.count is called every time, it needs to call code ﹣ server to confirm whether the List module exists. When the access jitters, if the allocation main process suddenly accumulates a large number of messages, the call code ﹣ server will become slower, resulting in slower processing speed, more and more messages, worse performance, and guidance. Cause memory to skyrocket.

Positioning process

Find the most memory consuming process.

Enum.map(:erlang.processes(), fn proc -> {:erlang.process\_info(proc, :memory), proc} end) |> Enum.sort(fn({{\_, a}, \_}, {{\_, b}, \_}) -> a > b end) |> List.first

It is found that the process has extremely high reductions, message queue len, memory.

iex(xxxx@xxxx)5> :erlang.process_info(pid)
[registered_name: xxx, current_function: {:code_server, :call, 1},
 initial_call: {:proc_lib, :init_p, 5}, status: :running,
 message_queue_len: 1426935,
 messages: [notify: {:log, :"xxx.xxxxx.xxxxxxx",
   <<107, 0, 0, 0, 3, 100, 97, 116, 97, 0, 82, 0, 0, 0, 2, 112, 108, 97, 121,
     101, 114, 95, 115, 101, 115, 115, 105, 111, 110, 0, 11, 0, 0, 0, 52, 74,
     85, 95, 89, 95, 55, ...>>},
     ............
  ...], links: [#PID<0.1749.0>],
 dictionary: [rand_seed: {%{bits: 58,
     jump: #Function<8.15449617/1 in :rand.mk_alg/1>,
     next: #Function<5.15449617/1 in :rand.mk_alg/1>, type: :exrop,
     uniform: #Function<6.15449617/1 in :rand.mk_alg/1>,
     uniform_n: #Function<7.15449617/2 in :rand.mk_alg/1>, weak_low_bits: 1},
   [92689310913950825 | 4906624003405292]},
  "$initial_call": {GenEvent, :init_it, 6},
  "$ancestors": [XXX.Route.Supervisor, XXX.Supervisor, #PID<0.1747.0>]],
 trap_exit: false, error_handler: :error_handler, priority: :normal,
 group_leader: #PID<0.1746.0>, total_heap_size: 28600107, heap_size: 18481566,
 stack_size: 45, reductions: 7024105801,
 garbage_collection: [max_heap_size: %{error_logger: true, kill: true, size: 0},
  min_bin_vheap_size: 46422, min_heap_size: 233, fullsweep_after: 65535,
  minor_gcs: 2], suspending: []]

Doubt Enum.random

This process is a GenEvent process, which has three handle ¢ u event ({: log, name, bin}, state) functions. Only the handle ¢ u event that writes the file has been moved. It is suspected that the performance of Enum.random is too poor. benchmark found that the average time-consuming is 0.7 μ s. it will not become a bottleneck.

Rollback code can fix problems

The pressure test environment could not be reproduced, so the code had to be rolled back.
After rollback, the online memory will not skyrocket.

Reproduce and solve problems

Online elixir version:

 Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [kernel-poll:false]
Interactive Elixir (1.5.1) - press Ctrl+C to exit (type h() ENTER for help)

Pressure test environment elixir version:

Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [hipe] [kernel-poll:false]
Interactive Elixir (1.6.4) - press Ctrl+C to exit (type h() ENTER for help)

It is suspected that the elixir version is different.
After adjusting the elixir version of the pressure test environment to be consistent with the online version, the memory inflation problem has been reproduced, but it can not be reproduced stably.
After finding ctrl + c and continuing, making jitter can stably reproduce the problem of memory inflation.
After removing Enum.random in def handle_event({:log, _name, _bin}), the problem no longer appears.

Cause of the problem

Because removing Enum.random can solve the problem, and only a specific version of elixir can reproduce the problem, so compare the implementation of elixir Enum.random in different versions.

Enum.random under elixir v1.5.1

Enum.random will eventually call, random_integer(0, count_of_enumerable)
enum.ex:2763

defp random_integer(lower_limit, upper_limit) do
    lower_limit + :rand.uniform(upper_limit - lower_limit + 1) - 1
  end

rand.erl:281

uniform/1
-spec uniform(N :: pos_integer()) -> X :: pos_integer().
uniform(N) ->
    {X, Seed} = uniform_s(N, seed_get()),
    _ = seed_put(Seed),
    X.

rand.erl:403

seed_get() ->
    case get(?SEED_DICT) of
        undefined -> seed(?DEFAULT_ALG_HANDLER);
        Old -> Old  % no type checking here
    end.

Save / get the process dictionary used for seed, no need to worry about the efficiency.
Because elixir version 1.6.0 can not reproduce the problem (v1.5.1 - > v1.5.3 can reproduce). Compared with elixir version 1.5.1 / v1.6.3, no suspicious points were found.

Why call code? Server?

Through: observer.start, you can see the call stack of MainChannel and trace process messages. Every Enum.random call, there will be a code_server:call. The call stack on the front line is also a call code_server, which must not be a coincidence.

Protocol consolidation

https://elixir-lang.org/getti...

When working with Elixir projects, using the Mix build tool, you may see the output as follows:

Consolidated String.Chars
Consolidated Collectable
Consolidated List.Chars
Consolidated IEx.Info
Consolidated Enumerable
Consolidated Inspect

Those are all protocols that ship with Elixir and they are being consolidated. Because a protocol can dispatch to any data type, the protocol must check on every call if an implementation for the given type exists. This may be expensive.
However, after our project is compiled using a tool like Mix, we know all modules that have been defined, including protocols and their implementations. This way, the protocol can be consolidated into a very simple and fast dispatch module.
From Elixir v1.2, protocol consolidation happens automatically for all projects. We will build our own project in the Mix and OTP guide.

exrm bug causes consolidation to fail

Decompile the Enumerable of v1.5.1, and it is found that the call "ensure u compiled".

dc = fn x ->
p = :code.which(x)
{:ok,{_,[{:abstract_code,{_, ac}}]}} = :beam_lib.chunks(p,  [:abstract_code])
IO.write(:erl_prettypr.format(:erl_syntax.form_list(ac)))
end
dc.(Enumerable)
count(_@1) -> ('impl_for!'(_@1)):count(_@1).

impl_for(_@1) when erlang:is_list(_@1) ->
    case 'impl_for?'('Elixir.Enumerable.List') of
      true -> 'Elixir.Enumerable.List':'__impl__'(target);
      false -> any_impl_for()
    end;

'impl_for?'(_@1) ->
    case 'Elixir.Code':'ensure_compiled?'(_@1) of
      true ->
          'Elixir.Kernel':'function_exported?'(_@1, '__impl__',
                                               1);
      false -> false;
      _@2 -> erlang:error({badbool, 'and', _@2})
    end.

Ensure? Called code? Server

code.ex:575

@spec ensure_compiled(module) ::
        {:module, module} | {:error, :embedded | :badfile | :nofile | :on_load_failure}
  def ensure_compiled(module) when is_atom(module) do
    case :code.ensure_loaded(module) do
      {:error, :nofile} = error ->
        if is_pid(:erlang.get(:elixir_compiler_pid)) and
           Kernel.ErrorHandler.ensure_compiled(module, :module) do
          {:module, module}
        else
          error
        end
      other -> other
    end
  end

code.erl:149

-spec ensure_loaded(Module) -> {module, Module} | {error, What} when
      Module :: module(),
      What :: embedded | badfile | nofile | on_load_failure.
ensure_loaded(Mod) when is_atom(Mod) -> 
    call({ensure_loaded,Mod}).

code.erl:627

call(Req) ->
    code_server:call(Req).

code_server.erl:136

-spec call(term()) -> term().
call(Req) ->
    Ref = erlang:monitor(process, ?MODULE),
    ?MODULE ! {code_call, self(), Req},
    receive 
  {?MODULE, Reply} ->
            erlang:demonitor(Ref,[flush]),
      Reply;
        {'DOWN',Ref,process,_,_} ->
            exit({'DOWN',code_server,Req})
    end.

exrm bug

~/ejoy/fuck/deps/exrm/lib/exrm/plugins » diff -u consolidation.ex ~/consolidation.ex                         
--- consolidation.ex  2018-07-26 20:00:38.000000000 +0800
+++ /home/enjolras/consolidation.ex 2018-07-26 19:59:53.324240446 +0800
@@ -27,11 +27,13 @@
       debug "Packaging consolidated protocols..."
 
       # Add overlay to relx.config which copies consolidated dir to release
-      consolidated_path = Path.join([Mix.Project.build_path, "consolidated"])
+      consolidated_path = Path.join([Mix.Project.build_path, "lib", config.name, "consolidated"])
       case File.ls(consolidated_path) do
         {:error, _} ->
+          IO.puts("consolidated notfound #{inspect consolidated_path}")
           config
         {:ok, filenames} ->
+          IO.puts("consolidated #{inspect filenames}")
           dest_path = "lib/#{config.name}-#{config.version}/consolidated"
           overlays = [overlay: Enum.map(filenames, fn name ->
             {:copy, '#{consolidated_path}/#{name}', '#{Path.join([dest_path, name])}'}

After fix, repackage. Decompile as above.
The impl for of list no longer calls' impl for? ', and the message stacking problem can no longer be reproduced.

impl_for(_@1) when erlang:is_list(_@1) ->
    case 'impl_for?'('Elixir.Enumerable.List') of
      true -> 'Elixir.Enumerable.List':'__impl__'(target);
      false -> any_impl_for()
    end;

It's a direct return.

impl_for(x) when erlang:is_list(x) ->
    'Elixir.Enumerable.List';

Why can't elixir v1.6.3 reproduce the problem?

elixir 1.5.1

enum.ex:1708

def random(enumerable) do
    case Enumerable.count(enumerable) do
      {:ok, 0} ->
        raise Enum.EmptyError
      {:ok, count} ->
        at(enumerable, random_integer(0, count - 1))
      {:error, _} ->
        case take_random(enumerable, 1) do
          []     -> raise Enum.EmptyError
          [elem] -> elem
        end
    end
  end

elixir 1.6.3

enum.ex:1790

def random(enumerable) do
    result =
      case backwards_compatible_slice(enumerable) do
        {:ok, 0, _} ->
          []
        {:ok, count, fun} when is_function(fun) ->
          fun.(random_integer(0, count - 1), 1)
        {:error, _} ->
          take_random(enumerable, 1)
      end
    case result do
      [] -> raise Enum.EmptyError
      [elem] -> elem
    end
  end
  # TODO: Remove me on Elixir v1.8
  defp backwards_compatible_slice(args) do
    try do
      Enumerable.slice(args)
    catch
      :error, :undef ->
        case System.stacktrace() do
          [{module, :slice, [^args], _} | _] -> {:error, module}
          stack -> :erlang.raise(:error, :undef, stack)
        end
    end
  end

In the same decompilation, Enumerable.slice will call impl_for and call code_server, which can't explain why v1.6.3 can't be reproduced.

slice(_@1) -> ('impl_for!'(_@1)):slice(_@1).

summary

  • The problem may appear in the most unlikely place. In many cases, binary rollback can solve the problem quickly.
  • It is easy to have performance problems outside a busy process call, because the returned messages need to traverse the entire mailbox.

Topics: Erlang supervisor MongoDB