Get Video ID from URL using Regex Match and Elixir

Introduction

This is part five of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀 In the last post, Filtering a List Using Regex Match and Elixir, we trimmed our list to only contain the paths: "example.com/04C0BF/v2/sources/content-owners/" and "example.com/04C0BF/ads/transcodes/". Looking at our remaining steps, we see that we still need to trim the URL to just be the 4, 5, or 6 digit video_id.

  1. Fetch data from URL
  2. Split each new line into a list item
  3. Split each line into list items
  4. Filter items to only contain the URL and TCP_HIT/MISS
  5. Find the six-digit Video ID from the URL, it should be the first integer in HTTP paths of:
    • "example.com/04C0BF/v2/sources/content-owners/"
    • "example.com/04C0BF/ads/transcodes/"
  6. Group by Video ID
  7. Get Cache Hit and Misses for each Video
  8. Calculate the Cache Hit Misses
  9. Sort by Video ID
  10. Print to file

Our data is still looking something like this:

terminal

[
  ....
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/326260/v20169101326-1256x544-3063k.mp4+3713710.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/206"
  ],
  ...
]

Write a Test

Let's start by writing a simple test:

test/access_log_app_test.exs

defmodule AccessLogAppTest do

  test "Gets first integer in URL path" do
    data = [
      [
        tcp_hit: "TCP_HIT/200",
        http: "http://example.com/ABCD/a/b/c/123456/somefile.mp4.ts"
      ],
      [
        tcp_hit: "TCP_HIT/206",
        http: "http://example.com/ABCD/e/f/789012/someotherfile.mp4.ts"
      ]
    ]
    result = get_id_from_url_path(data)
    assert result == [
      [ tcp_hit: "TCP_HIT/200",
        video_id: 123456
      ],
      [
        tcp_hit: "TCP_HIT/206",
        video_id: 789012
      ]
    ]
  end
end

From the above test we see that we need to create a function that takes a URL and finds the video_id, which is the first integer in the URL path. To do this, we can create a function that takes a string and splits it by "/" and only returns the first chunk that contains an integer using a regex.

Solution

lib/access_log_app/CLI.ex

defmodule AccessLogApp.CLI do
  ...

  def get_id_from_url_path(list) do
    Enum.map(list, fn entry ->
      Enum.map(entry, fn items ->
        case items do
          {:http, url} ->
            [video_id | _] = url
            |> String.split("/")
            |> Enum.map(fn keep_if_int ->
              case Regex.match?(~r(\b^\d{6}\b|\b^\d{5}\b|\b^\d{4}\b), keep_if_int) do
                true -> keep_if_int
                _ -> ""
              end
            end)
            |> Enum.filter(& !is_blank(&1))

            {:video_id, elem(Integer.parse(video_id), 0)}
          {k, v} -> {k, v}
          _ -> ""
        end
      end)
    end)
  end

  ...

  def is_blank(nil), do: true
  def is_blank(val) when val == %{}, do: true
  def is_blank(val) when val == [], do: true
  def is_blank(val) when is_binary(val), do: String.trim(val) == ""
  def is_blank(_val), do: false

  ...

end
terminal

iex -S mix
terminal

iex(1)AccessLogApp.CLI.fetch
Compiling 1 file (.ex)
{:ok,
 [
   [video_id: 275211, tcp: "TCP_HIT/200"],
   [video_id: 326260, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 351421, tcp: "TCP_HIT/200"],
   [video_id: 12410, tcp: "TCP_HIT/200"],
   [video_id: 339342, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 367665, tcp: "TCP_HIT/200"],
   [video_id: 367706, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 312985, tcp: "TCP_MISS/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 23261, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 12410, tcp: "TCP_HIT/200"],
   [video_id: 291986, tcp: "TCP_HIT/200"],
   [video_id: 360634, tcp: "TCP_HIT/200"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, ...],
   [...],
   ...
 ]}

Conclusion

In the post, we finally distilled a URL value to an integer value, all while keeping the remaining key values in the list intact. In the next post, we group the TCP_HITS/MISS by their video_id. If you like this post, please share and subscribe!

Launch Your Project

Get your project off the ground with Space-Rocket! Fill out the form below to get started.

Space-Rocket pin icon