Filtering a List Using Regex Match and Elixir

This is part four of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀 In the last post, Getting Items from List with Elixir, We trimmed our list to only contain the items we want, the TCP_HIT/MISS, and the URL. Our data is now looking like this:

bash


[
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/326260/v20169101326-1256x544-3063k.mp4+3713710.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/206"
  },
...
]

~~Fetch data from URL~~
~~Split each new line into a list item~~
~~Split each line into list items~~
~~Filter items to only contain the URL and TCP_HIT/MISS~~
Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
- "example.com/04C0BF/v2/sources/content-owners/"
- "example.com/04C0BF/ads/transcodes/"
Group by Video ID
Get Cache Hit and Misses for each Video
Calculate the Cache Hit Misses
Sort by video id
Print to file

We now want to get the Video ID following these formats, where 384055 and 006817 respectively, contain the Video IDs:

http://example.com/04C0BF/v2/sources/content-owners/cinedigm-tubi/384055/v201708302148-2273k.mp4+4023936.ts
http://example.com/04C0BF/ads/transcodes/006817/2791522/v0402000243-854x480-HD-1401k.mp4+22355.ts

But we don't want URLs that don't contain those paths. For example, if you were to dig deep into the 5k lines, you'd see entries like: [ http: "http://example.com/80C0BF/subtitles/422e3734-382b-4bb3-a753-e3f003d9cdd6.m3u8", tcp: "TCP_HIT/200" ],

Lets start with a test:

filtering-a-list-using-regex-match.test.ex


test "filters list by strings" do
  list = [
    %{
      http: "http://example1.ts/yep/a/b/c/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/206"
    },
    %{
      http: "http://example1.ts/nope/a/b/c/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/200"
    },
    %{
      http: "http://example2.ts/yep/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/200"
    },
    %{
      http: "http://example2.ts/nope/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/200"
    }
  ]
  result = filter_list_by_strings(list, ["example1.ts/yep/a/b/c", "example2.ts/yep/d/e/f"])
  assert result == [
    %{
      http: "http://example1.ts/yep/a/b/c/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/206"
    },
    %{
      http: "http://example2.ts/yep/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
      tcp: "TCP_HIT/200"
    }
  ]
end

Our solution is to create a function that takes our list and the paths we want to match. We then run the list through Enum.filter, and grab the HTTP value through using the map.key notation. We then do a Regex.match? inside the parenthesis (), Having the nested parenthesis allows the "or" operator | to be used.

filtering-a-list-using-regex-match.ex


def filter_list_by_strings(list, paths) do
  [a, b] = paths
  Enum.filter(list, fn item  ->
    http = item[:http]
    Regex.match?(~r/http:\/\/((#{a}|#{b}))\//,"#{http}")
  end)
end

In this post, we saw how easy it is to filter through a list of maps, grab a value and then perform a simple regex to match the URLs that match the paths of the formats that contain our Video IDs. In the next post, we will use some more regex to get the Video ID from that URL. If you like this post, please share and subscribe!

Launch Your Project

Get your project off the ground
with Space-Rocket!

Fill out the form below to get started.