Get Video ID from URL using Regex Match and Elixir

Get Video ID from URL using Regex Match and Elixir

Introduction

This is part five of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀 In the last post, Filtering a List Using Regex Match and Elixir, we trimmed our list to only contain the paths: "example.com/04C0BF/v2/sources/content-owners/" and "example.com/04C0BF/ads/transcodes/". Looking at our remaining steps, we see that we still need to trim the URL to just be the 4, 5, or 6 digit video_id.
    1. Fetch data from URL
    2. Split each new line into a list item
    3. Split each line into list items
    4. Filter items to only contain the URL and TCP_HIT/MISS
    5. Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
"example.com/04C0BF/v2/sources/content-owners/" and "example.com/04C0BF/ads/transcodes/"
  1. Group by Video ID
  2. Get Cache Hit and Misses for each Video
  3. Calculate the Cache Hit Misses
  4. Sort by video id
  5. Print to file
Our data is still looking something like this
[
  ....
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/326260/v20169101326-1256x544-3063k.mp4+3713710.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/200"
  ],
  [
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/206"
  ],
  ...
]

Write a Test

Let's start by writing a simple test:
# test/access_log_app_test.exs
defmodule AccessLogAppTest do

  test "Gets first integer in URL path" do
    data = [
      [
        tcp_hit: "TCP_HIT/200",
        http: "http://example.com/ABCD/a/b/c/123456/somefile.mp4.ts"
      ],
      [
        tcp_hit: "TCP_HIT/206",
        http: "http://example.com/ABCD/e/f/789012/someotherfile.mp4.ts"
      ]
    ]
    result = get_id_from_url_path(data)
    assert result == [
      [ tcp_hit: "TCP_HIT/200",
        video_id: 123456
      ],
      [
        tcp_hit: "TCP_HIT/206",
        video_id: 789012
      ]
    ]
  end
end

From the above test we see that we need to create a function that takes a URL and finds the video_id, which is the first integer in the URL path. To do this, we can create a function that takes a string and splits it by "/" and only returns the first chunk that contains an integer using a regex.

Solution

# lib/access_log_app/CLI.ex
defmodule AccessLogApp.CLI do
  ...

  def get_id_from_url_path(list) do
    Enum.map(list, fn entry ->
      Enum.map(entry, fn items ->
        case items do
          {:http, url} ->
            [video_id | _] = url
            |> String.split("/")
            |> Enum.map(fn keep_if_int ->
              case Regex.match?(~r(\b^\d{6}\b|\b^\d{5}\b|\b^\d{4}\b), keep_if_int) do
                true -> keep_if_int
                _ -> ""
              end
            end)
            |> Enum.filter(& !is_blank(&1))

            {:video_id, elem(Integer.parse(video_id), 0)}
          {k, v} -> {k, v}
          _ -> ""
        end
      end)
    end)
  end

  ...

  def is_blank(nil), do: true
  def is_blank(val) when val == %{}, do: true
  def is_blank(val) when val == [], do: true
  def is_blank(val) when is_binary(val), do: String.trim(val) == ""
  def is_blank(_val), do: false

  ...

end

How it works

The above function Enum.maps over the list to access each entry. To make the function more reusable, in case there's more data than tcp_hit/miss needed, we Enum.map again to get each element in each entry, as opposed to using map.key notation. We then do a case statement to pattern match on items that are tuples with left hand value of :http. From there, we do a String.split/2, on / and pattern match with integers that are 4, 5, and 6 digits long or returning an empty string. We then use a custom function to remove blanks. Finally, we assign the video_id as a parsed integer inside a new tuple. All other items are returned as their original key-value pair in tuple form. Run the tests and they pass!

iex -S mix

iex(1)AccessLogApp.CLI.fetch
Compiling 1 file (.ex)
{:ok,
 [
   [video_id: 275211, tcp: "TCP_HIT/200"],
   [video_id: 326260, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 351421, tcp: "TCP_HIT/200"],
   [video_id: 12410, tcp: "TCP_HIT/200"],
   [video_id: 339342, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 160842, tcp: "TCP_HIT/206"],
   [video_id: 367665, tcp: "TCP_HIT/200"],
   [video_id: 367706, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 312985, tcp: "TCP_MISS/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 398629, tcp: "TCP_HIT/206"],
   [video_id: 23261, tcp: "TCP_HIT/200"],
   [video_id: 414098, tcp: "TCP_HIT/200"],
   [video_id: 12410, tcp: "TCP_HIT/200"],
   [video_id: 291986, tcp: "TCP_HIT/200"],
   [video_id: 360634, tcp: "TCP_HIT/200"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, tcp: "TCP_HIT/206"],
   [video_id: 186001, ...],
   [...],
   ...
 ]}

Conclusion

In the post, we finally distilled a URL value to an integer value, all while keeping the remaining key values in the list intact. In the next post, we group the TCP_HITS/MISS by their video_id. If you like this post, please share and subscribe!