bug review
Bug fix timeline
- At 10:00 p.m. on x month x, the memory of X service began to rise rapidly.
- At 10:18 on X, users reported that x service memory rose rapidly.
- mongodb was restarted due to insufficient memory at 10:20 on x
- Restart x service at 10:22 on X
- At 11:00 p.m. on x month x, the memory of X service rose rapidly again.
- At 14:45, the x service is rolled back and updated again. The x service returns to normal.
- At 19:48 on x, it was found that the Enum.random function had a performance problem, which was fixed and tested.
- At 20:51 on x / XX, fixed the problem of fast rising of x service memory and updated it to online.
Brief introduction of bug causes
The process pool is used to eliminate the bottleneck.
The random function (Enum.random) used to allocate processes calls Enumerable.count to get the List length.
exrm, the Elixir publishing tool used, has a bug and does not consolidate the calls. As a result, when Enumerable.count is called every time, it needs to call code ﹣ server to confirm whether the List module exists. When the access jitters, if the allocation main process suddenly accumulates a large number of messages, the call code ﹣ server will become slower, resulting in slower processing speed, more and more messages, worse performance, and guidance. Cause memory to skyrocket.
Positioning process
Find the most memory consuming process.
Enum.map(:erlang.processes(), fn proc -> {:erlang.process\_info(proc, :memory), proc} end) |> Enum.sort(fn({{\_, a}, \_}, {{\_, b}, \_}) -> a > b end) |> List.first
It is found that the process has extremely high reductions, message queue len, memory.
iex(xxxx@xxxx)5> :erlang.process_info(pid) [registered_name: xxx, current_function: {:code_server, :call, 1}, initial_call: {:proc_lib, :init_p, 5}, status: :running, message_queue_len: 1426935, messages: [notify: {:log, :"xxx.xxxxx.xxxxxxx", <<107, 0, 0, 0, 3, 100, 97, 116, 97, 0, 82, 0, 0, 0, 2, 112, 108, 97, 121, 101, 114, 95, 115, 101, 115, 115, 105, 111, 110, 0, 11, 0, 0, 0, 52, 74, 85, 95, 89, 95, 55, ...>>}, ............ ...], links: [#PID<0.1749.0>], dictionary: [rand_seed: {%{bits: 58, jump: #Function<8.15449617/1 in :rand.mk_alg/1>, next: #Function<5.15449617/1 in :rand.mk_alg/1>, type: :exrop, uniform: #Function<6.15449617/1 in :rand.mk_alg/1>, uniform_n: #Function<7.15449617/2 in :rand.mk_alg/1>, weak_low_bits: 1}, [92689310913950825 | 4906624003405292]}, "$initial_call": {GenEvent, :init_it, 6}, "$ancestors": [XXX.Route.Supervisor, XXX.Supervisor, #PID<0.1747.0>]], trap_exit: false, error_handler: :error_handler, priority: :normal, group_leader: #PID<0.1746.0>, total_heap_size: 28600107, heap_size: 18481566, stack_size: 45, reductions: 7024105801, garbage_collection: [max_heap_size: %{error_logger: true, kill: true, size: 0}, min_bin_vheap_size: 46422, min_heap_size: 233, fullsweep_after: 65535, minor_gcs: 2], suspending: []]
Doubt Enum.random
This process is a GenEvent process, which has three handle ¢ u event ({: log, name, bin}, state) functions. Only the handle ¢ u event that writes the file has been moved. It is suspected that the performance of Enum.random is too poor. benchmark found that the average time-consuming is 0.7 μ s. it will not become a bottleneck.
Rollback code can fix problems
The pressure test environment could not be reproduced, so the code had to be rolled back.
After rollback, the online memory will not skyrocket.
Reproduce and solve problems
Online elixir version:
Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [kernel-poll:false] Interactive Elixir (1.5.1) - press Ctrl+C to exit (type h() ENTER for help)
Pressure test environment elixir version:
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [hipe] [kernel-poll:false] Interactive Elixir (1.6.4) - press Ctrl+C to exit (type h() ENTER for help)
It is suspected that the elixir version is different.
After adjusting the elixir version of the pressure test environment to be consistent with the online version, the memory inflation problem has been reproduced, but it can not be reproduced stably.
After finding ctrl + c and continuing, making jitter can stably reproduce the problem of memory inflation.
After removing Enum.random in def handle_event({:log, _name, _bin}), the problem no longer appears.
Cause of the problem
Because removing Enum.random can solve the problem, and only a specific version of elixir can reproduce the problem, so compare the implementation of elixir Enum.random in different versions.
Enum.random under elixir v1.5.1
Enum.random will eventually call, random_integer(0, count_of_enumerable)
enum.ex:2763
defp random_integer(lower_limit, upper_limit) do lower_limit + :rand.uniform(upper_limit - lower_limit + 1) - 1 end
rand.erl:281
uniform/1 -spec uniform(N :: pos_integer()) -> X :: pos_integer(). uniform(N) -> {X, Seed} = uniform_s(N, seed_get()), _ = seed_put(Seed), X.
rand.erl:403
seed_get() -> case get(?SEED_DICT) of undefined -> seed(?DEFAULT_ALG_HANDLER); Old -> Old % no type checking here end.
Save / get the process dictionary used for seed, no need to worry about the efficiency.
Because elixir version 1.6.0 can not reproduce the problem (v1.5.1 - > v1.5.3 can reproduce). Compared with elixir version 1.5.1 / v1.6.3, no suspicious points were found.
Why call code? Server?
Through: observer.start, you can see the call stack of MainChannel and trace process messages. Every Enum.random call, there will be a code_server:call. The call stack on the front line is also a call code_server, which must not be a coincidence.
Protocol consolidation
https://elixir-lang.org/getti...
When working with Elixir projects, using the Mix build tool, you may see the output as follows:
Consolidated String.Chars Consolidated Collectable Consolidated List.Chars Consolidated IEx.Info Consolidated Enumerable Consolidated Inspect
Those are all protocols that ship with Elixir and they are being consolidated. Because a protocol can dispatch to any data type, the protocol must check on every call if an implementation for the given type exists. This may be expensive.
However, after our project is compiled using a tool like Mix, we know all modules that have been defined, including protocols and their implementations. This way, the protocol can be consolidated into a very simple and fast dispatch module.
From Elixir v1.2, protocol consolidation happens automatically for all projects. We will build our own project in the Mix and OTP guide.
exrm bug causes consolidation to fail
Decompile the Enumerable of v1.5.1, and it is found that the call "ensure u compiled".
dc = fn x -> p = :code.which(x) {:ok,{_,[{:abstract_code,{_, ac}}]}} = :beam_lib.chunks(p, [:abstract_code]) IO.write(:erl_prettypr.format(:erl_syntax.form_list(ac))) end
dc.(Enumerable) count(_@1) -> ('impl_for!'(_@1)):count(_@1). impl_for(_@1) when erlang:is_list(_@1) -> case 'impl_for?'('Elixir.Enumerable.List') of true -> 'Elixir.Enumerable.List':'__impl__'(target); false -> any_impl_for() end; 'impl_for?'(_@1) -> case 'Elixir.Code':'ensure_compiled?'(_@1) of true -> 'Elixir.Kernel':'function_exported?'(_@1, '__impl__', 1); false -> false; _@2 -> erlang:error({badbool, 'and', _@2}) end.
Ensure? Called code? Server
code.ex:575
@spec ensure_compiled(module) :: {:module, module} | {:error, :embedded | :badfile | :nofile | :on_load_failure} def ensure_compiled(module) when is_atom(module) do case :code.ensure_loaded(module) do {:error, :nofile} = error -> if is_pid(:erlang.get(:elixir_compiler_pid)) and Kernel.ErrorHandler.ensure_compiled(module, :module) do {:module, module} else error end other -> other end end
code.erl:149
-spec ensure_loaded(Module) -> {module, Module} | {error, What} when Module :: module(), What :: embedded | badfile | nofile | on_load_failure. ensure_loaded(Mod) when is_atom(Mod) -> call({ensure_loaded,Mod}).
code.erl:627
call(Req) -> code_server:call(Req).
code_server.erl:136
-spec call(term()) -> term(). call(Req) -> Ref = erlang:monitor(process, ?MODULE), ?MODULE ! {code_call, self(), Req}, receive {?MODULE, Reply} -> erlang:demonitor(Ref,[flush]), Reply; {'DOWN',Ref,process,_,_} -> exit({'DOWN',code_server,Req}) end.
exrm bug
~/ejoy/fuck/deps/exrm/lib/exrm/plugins » diff -u consolidation.ex ~/consolidation.ex --- consolidation.ex 2018-07-26 20:00:38.000000000 +0800 +++ /home/enjolras/consolidation.ex 2018-07-26 19:59:53.324240446 +0800 @@ -27,11 +27,13 @@ debug "Packaging consolidated protocols..." # Add overlay to relx.config which copies consolidated dir to release - consolidated_path = Path.join([Mix.Project.build_path, "consolidated"]) + consolidated_path = Path.join([Mix.Project.build_path, "lib", config.name, "consolidated"]) case File.ls(consolidated_path) do {:error, _} -> + IO.puts("consolidated notfound #{inspect consolidated_path}") config {:ok, filenames} -> + IO.puts("consolidated #{inspect filenames}") dest_path = "lib/#{config.name}-#{config.version}/consolidated" overlays = [overlay: Enum.map(filenames, fn name -> {:copy, '#{consolidated_path}/#{name}', '#{Path.join([dest_path, name])}'}
After fix, repackage. Decompile as above.
The impl for of list no longer calls' impl for? ', and the message stacking problem can no longer be reproduced.
impl_for(_@1) when erlang:is_list(_@1) -> case 'impl_for?'('Elixir.Enumerable.List') of true -> 'Elixir.Enumerable.List':'__impl__'(target); false -> any_impl_for() end;
It's a direct return.
impl_for(x) when erlang:is_list(x) -> 'Elixir.Enumerable.List';
Why can't elixir v1.6.3 reproduce the problem?
elixir 1.5.1
enum.ex:1708
def random(enumerable) do case Enumerable.count(enumerable) do {:ok, 0} -> raise Enum.EmptyError {:ok, count} -> at(enumerable, random_integer(0, count - 1)) {:error, _} -> case take_random(enumerable, 1) do [] -> raise Enum.EmptyError [elem] -> elem end end end
elixir 1.6.3
enum.ex:1790
def random(enumerable) do result = case backwards_compatible_slice(enumerable) do {:ok, 0, _} -> [] {:ok, count, fun} when is_function(fun) -> fun.(random_integer(0, count - 1), 1) {:error, _} -> take_random(enumerable, 1) end case result do [] -> raise Enum.EmptyError [elem] -> elem end end # TODO: Remove me on Elixir v1.8 defp backwards_compatible_slice(args) do try do Enumerable.slice(args) catch :error, :undef -> case System.stacktrace() do [{module, :slice, [^args], _} | _] -> {:error, module} stack -> :erlang.raise(:error, :undef, stack) end end end
In the same decompilation, Enumerable.slice will call impl_for and call code_server, which can't explain why v1.6.3 can't be reproduced.
slice(_@1) -> ('impl_for!'(_@1)):slice(_@1).
summary
- The problem may appear in the most unlikely place. In many cases, binary rollback can solve the problem quickly.
- It is easy to have performance problems outside a busy process call, because the returned messages need to traverse the entire mailbox.