Elixir - Group by Matching ID
Elixir - Group by Matching ID

Elixir – Group by Matching ID

Introduction

This is part six of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀

Looking at our list of things to do, the next step is to group by video_id.

  1. Fetch data from URL
  2. Split each new line into a list item
  3. Split each line into list items
  4. Filter items to only contain the URL and TCP_HIT/MISS
  5. Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
  6. "example.com/04C0BF/v2/sources/content-owners/" and
    "example.com/04C0BF/ads/transcodes/"

  7. Group by Video ID
  8. Get Cache Hit and Misses for each Video
  9. Calculate the Cache Hit Misses
  10. Sort by video id
  11. Print to file

Our data is now looking something like this:


[
  [video_id: 275211, tcp: "TCP_HIT/200"],
  [video_id: 326260, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 351421, tcp: "TCP_HIT/200"],
  [video_id: 12410, tcp: "TCP_HIT/200"],
  [video_id: 339342, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 160842, tcp: "TCP_HIT/206"],
  [video_id: 367665, tcp: "TCP_HIT/200"],
  [video_id: 367706, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 312985, tcp: "TCP_MISS/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 398629, tcp: "TCP_HIT/206"],
  [video_id: 23261, tcp: "TCP_HIT/200"],
  [video_id: 414098, tcp: "TCP_HIT/200"],
  [video_id: 12410, tcp: "TCP_HIT/200"],
  [video_id: 291986, tcp: "TCP_HIT/200"],
  [video_id: 360634, tcp: "TCP_HIT/200"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, tcp: "TCP_HIT/206"],
  [video_id: 186001, ...],
  [...],
  ...
]

Looking at the Enum.group_by/3 documentation, see that it returns a map of list of items that match the given key in function call.

We can write our test to get started:

defmodule AccessLogAppTest do
  test "Groups list by video_id" do
    list = [
     [video_id: 1, tcp: "TCP_HIT/200"],
     [video_id: 1, tcp: "TCP_HIT/200"],
     [video_id: 1, tcp: "TCP_HIT/206"],
     [video_id: 1, tcp: "TCP_HIT/304"],
     [video_id: 2, tcp: "TCP_HIT/200"],
     [video_id: 2, tcp: "TCP_HIT/200"],
     [video_id: 2, tcp: "TCP_HIT/206"],
     [video_id: 2, tcp: "TCP_HIT/304"],
     [video_id: 3, tcp: "TCP_HIT/200"],
     [video_id: 3, tcp: "TCP_HIT/200"],
     [video_id: 3, tcp: "TCP_HIT/206"],
     [video_id: 3, tcp: "TCP_HIT/304"],
     [video_id: 4, tcp: "TCP_HIT/200"],
     [video_id: 4, tcp: "TCP_HIT/200"],
     [video_id: 4, tcp: "TCP_HIT/206"],
     [video_id: 4, tcp: "TCP_HIT/304"]
    ]
    result = group_by_id(list)
    assert result ==  %{
      [video_id: 1] => [
       [video_id: 1, tcp: "TCP_HIT/200"],
       [video_id: 1, tcp: "TCP_HIT/200"],
       [video_id: 1, tcp: "TCP_HIT/206"],
       [video_id: 1, tcp: "TCP_HIT/304"]
      ],
      [video_id: 2] => [
       [video_id: 2, tcp: "TCP_HIT/200"],
       [video_id: 2, tcp: "TCP_HIT/200"],
       [video_id: 2, tcp: "TCP_HIT/206"],
       [video_id: 2, tcp: "TCP_HIT/304"]
      ],
      [video_id: 3] => [
       [video_id: 3, tcp: "TCP_HIT/200"],
       [video_id: 3, tcp: "TCP_HIT/200"],
       [video_id: 3, tcp: "TCP_HIT/206"],
       [video_id: 3, tcp: "TCP_HIT/304"]
      ],
      [video_id: 4] => [
       [video_id: 4, tcp: "TCP_HIT/200"],
       [video_id: 4, tcp: "TCP_HIT/200"],
       [video_id: 4, tcp: "TCP_HIT/206"],
       [video_id: 4, tcp: "TCP_HIT/304"]
      ]
    }
  end
end

Our function looks like this

defmodule AccessLogApp.CLI do
  ...
  def group_by_id(list) do
    Enum.group_by(list, fn [video_id, _] ->
      [video_id]
    end)
  end
  ...
end 

Our function simply enumerates over the list groups by the video_id that is passed into the key_fun. Easy peas!

That is it for today!

Tomorrow we will be wrapping this up with step 7 “Calculating by Percentage”.

If you like, please share and subscribe!

Published
Categorized as Elixir

By mchavez

Michael Chavez is a web and software developer from San Francisco, California. His experience spans almost a decade, working with San Francisco Bay Area design and development agencies, and high-profile Silicon Valley start-ups and enterprises. After studying Multimedia at City College of San Francisco, Michael self-taught himself programming languages such as JavaScript, Node.js, PHP and founded the web development consultancy, Space-Rocket. Michael is currently working with the Elixir programming language.

Leave a comment

Your email address will not be published. Required fields are marked *