Intro to Processing Log File with Elixir

Intro to Processing Log File with Elixir

Introduction

I recently interviewed for a Senior Platform Engineer at company that is using Elixir. The code assessment was to process a log file that does the following:
  1. Parse access.log file hosted as gist
  2. Get tcp_hit percentage per video id
  3. Sort by video id (video id is an integer)
  4. print to console or write to file
  5. add tests if there is still time
Here are the exact instructions verbatim:
Write a script that does the following:

1. Parse access.log file hosted as gist
2. Get tcp_hit percentage per video id
3. Sort by video id (video id is an integer)
4. print to console or write to file
5. add tests if there is still time

there are two different url formats to handle:

http://example.com/04C0BF/v2/sources/content-owners/cinedigm-tubi/384055/v201708302148-2273k.mp4+4023936.ts

http://example.com/04C0BF/ads/transcodes/006817/2791522/v0402000243-854x480-HD-1401k.mp4+22355.ts

example line:
1523756544 3 86.45.165.83 1845784 152.195.141.240 80 TCP_HIT/200 1846031 GET http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts - 0 486 "-" "ItubExoPlayer/2.12.9 (Linux;Android 6.0) ExoPlayerLib/2.4.2" 49343 "-"

lines

cache hit/miss:
TCP_HIT

video id:
275211


URL
https://gist.github.com/bigbassroller/4980b59d205a33158b27bc10c6a13ed5

Steps to process a log file

So from this information, we break down the task into these steps:
  1. Fetch data from URL
  2. Split each new line into a list item
  3. Split each line into list items
  4. Filter items to only contain the URL and TCP_HIT/MISS
  5. Find the six digit video id from the URL, it should be the first integer in http paths of: “example.com/04C0BF/v2/sources/content-owners/”, “example.com/04C0BF/ads/transcodes/”
  6. Group by Video ID
  7. Get Cache Hit and Misses for each Video
  8. Calculate the Cache Hit Misses
  9. Sort by video id
  10. Print to file
  11. Get a job (profit)
We start be fetching the data from a Github Gist and splitting each new line into a list item and then splitting each space into a list item. Each line will be separated into something like this:
["1523756544", "3", "86.45.165.83", "1845784", "152.195.141.240", "80",
"TCP_HIT/200", "1846031", "GET",
"http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
"-", "0", "486", "\"-\"", "\"ItubExoPlayer/2.12.9", "(Linux;Android",
"6.0)", "ExoPlayerLib/2.4.2\"", "49343", "\"-\"", ""]
Paying attention to the details is important here. The requirements state that the urls come in two formats,
http://example.com/04C0BF/v2/sources/content-owners/cinedigm-tubi/......
and
http://example.com/04C0BF/ads/transcodes/006817/....
This is one of the most important details. There are 5000+ lines, some of them looking something like this:
http://example.com/80C0BF/subtitles/422e3734-382b-4bb3-a753-e3f003d9cdd6.m3u8
We don’t want those URLs. Its also important to note that the Video IDs are integers for the sorting by ID to work. We then group by Video ID, get the cache hit/miss and then calculate the percentage of hits over misses divided by total amount of hits. Finally we sort by Video ID and have the choice on how to present the data.

Getting started

First lets create our app
mix new access_log_app

Install dependencies

Next we are going to need HTTPoison library.
# mix.exs
defmodule AccessLogApp.MixProject do
  use Mix.Project
  ...

  # Run "mix help deps" to learn about dependencies.
  defp deps do
    [
      # {:dep_from_hexpm, "~> 0.3.0"},
      # {:dep_from_git, git: "https://github.com/elixir-lang/my_dep.git", tag: "0.1.0"}
      {:httpoison, "~> 1.8"}
    ]
  end

  ...
end
and then run mix deps.get
mix deps.get
Next, create a directory for the access_log_app namespace
mkdir lib/access_log_app
Now create a file named cli.ex in the access-log-app directory we just created.
# lib/access_log_app/cli.ex
defmodule AccessLogApp.CLI do
  def fetch() do
    HTTPoison.get("https://gist.githubusercontent.com/clanchun/2b5e07cda53718ccbf64f62fb31900c8/raw/64be7f018973717dd5faa7be2bfb817f50ed05bb/access.log")
    |> handle_response
  end

  def handle_response({_, %{status_code: status_code, body: body}}) do
    {
      status_code |> check_for_error(),
      body
    }
  end

  def check_for_error(200), do: :ok
  def check_for_error(_), do: :error
end
We can run the command and iex -S mix and will see we get a heap of data.
iex -S mix
recompile && AccessLogApp.CLI.fetch
iex(1)> recompile && AccessLogApp.CLI.fetch
Compiling 1 file (.ex)
{:ok,
 "#Fields: timestamp time-taken c-ip filesize s-ip s-port sc-status sc-bytes cs-method cs-uri-stem - rs-duration rs-bytes c-referrer c-user-agent customer-id x-ec_custom-1\n1523756544 3 86.45.165.83 1845784 152.195.141.240 80 TCP_HIT/200 1846031 GET http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts - 0 486 \"-\" \"ItubExoPlayer/2.12.9 (Linux;Android 6.0) ExoPlayerLib/2.4.2\" 49343 \"-\" \n1523756611 58 86.165.81.111 3364824 152.195.141.240 80 TCP_HIT/200 3365071 GET http://example.com/04C0BF/v2.... <> ...}

Creating Environment Variables with Elixir

Update: It turns out following is an anti-pattern. So instead, you shoud pass in the URL as function parameter. The URL is long and we might want to change it later for other projects. So lets put the URL in an environment variable. Create a file named config.exs inside a directory named config. Inside the file import Config module and set the github_url variable for our access_log_app.
# config/config.exs
import Config
config :access_log_app, github_url: "https://gist.githubusercontent.com/clanchun/2b5e07cda53718ccbf64f62fb31900c8/raw/64be7f018973717dd5faa7be2bfb817f50ed05bb/access.log"
We can then use it in our cli.ex file
# lib/access_log_app/CLI.ex
defmodule AccessLogApp.CLI do
  @github_url Application.get_env(:access_log_app, :github_url)

  def fetch() do
    HTTPoison.get("#{@github_url}")
    |> handle_response
  end

  def handle_response({_, %{status_code: status_code, body: body}}) do
    {
      status_code |> check_for_error(),
      body
    }
  end

  def check_for_error(200), do: :ok
  def check_for_error(_), do: :error
end

Next steps

In this post we covered the processing log file requirements, outlined the steps, created a our Elixir project and fetched our data from an external URL. We also created an environment variable, so our code is easier to reuse. In the next post we will take the line separated text and split each new line into a list item. Subscribe to receive updates!