Elixir – Group by Matching ID

Elixir - Group by Matching ID

Introduction

This is part six of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀 Looking at our list of things to do, the next step is to group by video_id.
  1. Fetch data from URL
  2. Split each new line into a list item
  3. Split each line into list items
  4. Filter items to only contain the URL and TCP_HIT/MISS
  5. Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
  6. "example.com/04C0BF/v2/sources/content-owners/" and "example.com/04C0BF/ads/transcodes/"
  7. Group by Video ID
  8. Get Cache Hit and Misses for each Video
  9. Calculate the Cache Hit Misses
  10. Sort by video id
  11. Print to file
Our data is now looking something like this:

[
  [video_id: 275211, tcp: "TCP_HIT/200"],
  [video_id: 326260, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 351421, tcp: "TCP_HIT/200"],
  [video_id: 12410, tcp: "TCP_HIT/200"],
  [video_id: 339342, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 367665, tcp: "TCP_HIT/200"],
  [video_id: 367706, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 312985, tcp: "TCP_MISS/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 23261, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 12410, tcp: "TCP_HIT/200"],
  [video_id: 291986, tcp: "TCP_HIT/200"],
  [video_id: 360634, tcp: "TCP_HIT/200"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, ...],
  [...],
  ...
]
Looking at the Enum.group_by/3 documentation, see that it returns a map of list of items that match the given key in function call. We can write our test to get started:
defmodule AccessLogAppTest do
  test "Groups list by video_id" do
    list = [
     [video_id: 1, tcp: "TCP_HIT/200"],
     [video_id: 1, tcp: "TCP_HIT/200"],
     [video_id: 1, tcp: "TCP_HIT/206"],
     [video_id: 1, tcp: "TCP_HIT/304"],
     [video_id: 2, tcp: "TCP_HIT/200"],
     [video_id: 2, tcp: "TCP_HIT/200"],
     [video_id: 2, tcp: "TCP_HIT/206"],
     [video_id: 2, tcp: "TCP_HIT/304"],
     [video_id: 3, tcp: "TCP_HIT/200"],
     [video_id: 3, tcp: "TCP_HIT/200"],
     [video_id: 3, tcp: "TCP_HIT/206"],
     [video_id: 3, tcp: "TCP_HIT/304"],
     [video_id: 4, tcp: "TCP_HIT/200"],
     [video_id: 4, tcp: "TCP_HIT/200"],
     [video_id: 4, tcp: "TCP_HIT/206"],
     [video_id: 4, tcp: "TCP_HIT/304"]
    ]
    result = group_by_id(list)
    assert result ==  %{
      [video_id: 1] => [
       [video_id: 1, tcp: "TCP_HIT/200"],
       [video_id: 1, tcp: "TCP_HIT/200"],
       [video_id: 1, tcp: "TCP_HIT/206"],
       [video_id: 1, tcp: "TCP_HIT/304"]
      ],
      [video_id: 2] => [
       [video_id: 2, tcp: "TCP_HIT/200"],
       [video_id: 2, tcp: "TCP_HIT/200"],
       [video_id: 2, tcp: "TCP_HIT/206"],
       [video_id: 2, tcp: "TCP_HIT/304"]
      ],
      [video_id: 3] => [
       [video_id: 3, tcp: "TCP_HIT/200"],
       [video_id: 3, tcp: "TCP_HIT/200"],
       [video_id: 3, tcp: "TCP_HIT/206"],
       [video_id: 3, tcp: "TCP_HIT/304"]
      ],
      [video_id: 4] => [
       [video_id: 4, tcp: "TCP_HIT/200"],
       [video_id: 4, tcp: "TCP_HIT/200"],
       [video_id: 4, tcp: "TCP_HIT/206"],
       [video_id: 4, tcp: "TCP_HIT/304"]
      ]
    }
  end
end
Our function looks like this
defmodule AccessLogApp.CLI do
  ...
  def group_by_id(list) do
    Enum.group_by(list, fn [video_id, _] ->
      [video_id]
    end)
  end
  ...
end 
Our function simply enumerates over the list groups by the video_id that is passed into the key_fun. Easy peas! That is it for today! Tomorrow we will be wrapping this up with step 7 "Calculating by Percentage". If you like, please share and subscribe!