Introduction
This is part five of the nine post series on
Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀
In the last post,
Filtering a List Using Regex Match and Elixir, we trimmed our list to only contain the paths:
"example.com/04C0BF/v2/sources/content-owners/"
and
"example.com/04C0BF/ads/transcodes/"
.
Looking at our remaining steps, we see that we still need to trim the URL to just be the
4
,
5
, or
6
digit
video_id
.
-
Fetch data from URL
Split each new line into a list item
Split each line into list items
Filter items to only contain the URL and TCP_HIT/MISS
Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
"example.com/04C0BF/v2/sources/content-owners/"
and
"example.com/04C0BF/ads/transcodes/"
- Group by Video ID
- Get Cache Hit and Misses for each Video
- Calculate the Cache Hit Misses
- Sort by video id
- Print to file
Our data is still looking something like this
[
....
[
http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
tcp: "TCP_HIT/200"
],
[
http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/326260/v20169101326-1256x544-3063k.mp4+3713710.ts",
tcp: "TCP_HIT/200"
],
[
http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
tcp: "TCP_HIT/200"
],
[
http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
tcp: "TCP_HIT/206"
],
...
]
Write a Test
Let's start by writing a simple test:
# test/access_log_app_test.exs
defmodule AccessLogAppTest do
test "Gets first integer in URL path" do
data = [
[
tcp_hit: "TCP_HIT/200",
http: "http://example.com/ABCD/a/b/c/123456/somefile.mp4.ts"
],
[
tcp_hit: "TCP_HIT/206",
http: "http://example.com/ABCD/e/f/789012/someotherfile.mp4.ts"
]
]
result = get_id_from_url_path(data)
assert result == [
[ tcp_hit: "TCP_HIT/200",
video_id: 123456
],
[
tcp_hit: "TCP_HIT/206",
video_id: 789012
]
]
end
end
From the above test we see that we need to create a function that takes a
URL
and finds the
video_id
, which is the first integer in the
URL
path.
To do this, we can create a function that takes a string and splits it by "
/
" and only returns the first chunk that contains an integer using a regex.
Solution
# lib/access_log_app/CLI.ex
defmodule AccessLogApp.CLI do
...
def get_id_from_url_path(list) do
Enum.map(list, fn entry ->
Enum.map(entry, fn items ->
case items do
{:http, url} ->
[video_id | _] = url
|> String.split("/")
|> Enum.map(fn keep_if_int ->
case Regex.match?(~r(\b^\d{6}\b|\b^\d{5}\b|\b^\d{4}\b), keep_if_int) do
true -> keep_if_int
_ -> ""
end
end)
|> Enum.filter(& !is_blank(&1))
{:video_id, elem(Integer.parse(video_id), 0)}
{k, v} -> {k, v}
_ -> ""
end
end)
end)
end
...
def is_blank(nil), do: true
def is_blank(val) when val == %{}, do: true
def is_blank(val) when val == [], do: true
def is_blank(val) when is_binary(val), do: String.trim(val) == ""
def is_blank(_val), do: false
...
end
How it works
The above function
Enum.maps
over the list to access each entry. To make the function more reusable, in case there's more data than
tcp_hit/miss
needed, we
Enum.map
again to get each element in each entry, as opposed to using
map.key
notation. We then do a case statement to pattern match on items that are tuples with left hand value of
:http
. From there, we do a
String.split/2
, on
/
and pattern match with integers that are
4
,
5
, and
6
digits long or returning an empty string. We then use a custom function to remove blanks. Finally, we assign the
video_id
as a parsed integer inside a new tuple. All other items are returned as their original key-value pair in tuple form.
Run the tests and they pass!
iex -S mix
iex(1)AccessLogApp.CLI.fetch
Compiling 1 file (.ex)
{:ok,
[
[video_id: 275211, tcp: "TCP_HIT/200"],
[video_id: 326260, tcp: "TCP_HIT/200"],
[video_id: 398629, tcp: "TCP_HIT/200"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 351421, tcp: "TCP_HIT/200"],
[video_id: 12410, tcp: "TCP_HIT/200"],
[video_id: 339342, tcp: "TCP_HIT/200"],
[video_id: 414098, tcp: "TCP_HIT/200"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 160842, tcp: "TCP_HIT/206"],
[video_id: 367665, tcp: "TCP_HIT/200"],
[video_id: 367706, tcp: "TCP_HIT/200"],
[video_id: 414098, tcp: "TCP_HIT/200"],
[video_id: 312985, tcp: "TCP_MISS/200"],
[video_id: 414098, tcp: "TCP_HIT/200"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 398629, tcp: "TCP_HIT/206"],
[video_id: 23261, tcp: "TCP_HIT/200"],
[video_id: 414098, tcp: "TCP_HIT/200"],
[video_id: 12410, tcp: "TCP_HIT/200"],
[video_id: 291986, tcp: "TCP_HIT/200"],
[video_id: 360634, tcp: "TCP_HIT/200"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, tcp: "TCP_HIT/206"],
[video_id: 186001, ...],
[...],
...
]}
Conclusion
In the post, we finally distilled a URL value to an integer value, all while keeping the remaining key values in the list intact. In the next post, we group the
TCP_HITS/MISS
by their
video_id
.
If you like this post, please share and subscribe!