Filtering a List Using Regex Match and Elixir

Filtering a List Using Regex Match and Elixir

This is part four of the nine post series on Processing a Log File with Elixir. If you find this article helpful, please subscribe and share 🚀

In the last post, Getting Items from List with Elixir, We trimmed our list to only contain the items we want, the TCP_HIT/MISS, and the URL. Our data is now looking like this:

[
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/275211/v0401185814-1389k.mp4+740005.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/sgl-entertainment/326260/v20169101326-1256x544-3063k.mp4+3713710.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts",
    tcp: "TCP_HIT/200"
  },
  %{
    http: "http://example.com/04C0BF/v2/sources/content-owners/cinedigm-itub/398629/v201711170053-2061k.mp4+4582327.ts", 
    tcp: "TCP_HIT/206"
  },
...
]

 

Let’s take a look at where we are on our steps:

  1. Fetch data from URL
  2. Split each new line into a list item
  3. Split each line into list items
  4. Filter items to only contain the URL and TCP_HIT/MISS
  5. Find the six-digit video id from the URL, it should be the first integer in HTTP paths of:
  6. "example.com/04C0BF/v2/sources/content-owners/" and
    "example.com/04C0BF/ads/transcodes/"

  7. Group by Video ID
  8. Get Cache Hit and Misses for each Video
  9. Calculate the Cache Hit Misses
  10. Sort by video id
  11. Print to file

 

We now want to get the Video ID following these formats, where 384055 and 006817 respectively, contain the Video IDs:

http://example.com/04C0BF/v2/sources/content-owners/cinedigm-tubi/384055/v201708302148-2273k.mp4+4023936.ts

http://example.com/04C0BF/ads/transcodes/006817/2791522/v0402000243-854x480-HD-1401k.mp4+22355.ts

But we don’t want URLs that don’t contain those paths. For example, if you were to dig deep into the 5k lines, you’d see entries like:

...
[
http: "http://example.com/80C0BF/subtitles/422e3734-382b-4bb3-a753-e3f003d9cdd6.m3u8",
tcp: "TCP_HIT/200"
],
...

 

Lets start with a test

 

# access_log_app/test/access_log_app_test.exs
  test "filters list by strings" do
    list = [
      %{
        http: "http://example1.ts/yep/a/b/c/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/206"
      },
      %{
        http: "http://example1.ts/nope/a/b/c/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/200"
      },
      %{
        http: "http://example2.ts/yep/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/200"
      },
      %{
        http: "http://example2.ts/nope/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/200"
      }
    ]
    result = filter_list_by_strings(list, ["example1.ts/yep/a/b/c", "example2.ts/yep/d/e/f"])
    assert result == [
      %{
        http: "http://example1.ts/yep/a/b/c/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/206"
      },
      %{
        http: "http://example2.ts/yep/d/e/f/some-string/some-string/123456/01234-56789.1011.ts",
        tcp: "TCP_HIT/200"
      }
    ]
  end

 

Our solution is to create a function that takes our list and the paths we want to match. We then run the list through Enum.filter, and grab the HTTP value through using the map.key notation. We then do a Regex.match? inside the parenthesis “()“, Having the nested parenthesis allows the “or” operator “|” to be used.

 

# access_log_app/lib/access_log_app/CLI.ex
  def filter_list_by_strings(list, paths) do
    [a, b] = paths
    Enum.filter(list, fn item  ->
      http = item[:http]
      Regex.match?(~r/http:\/\/((#{a}|#{b}))\//,"#{http}")
    end)
  end

 

In this post, we saw how easy it is to filter through a list of maps, grab a value and then perform a simple regex to match the URLs that match the paths of the formats that contain our Video IDs. In the next post, we will use some more regex to get the Video ID from that URL.

 

If you like this post, please share and subscribe!

Published
Categorized as Elixir

By mchavez

Michael Chavez is a web and software developer from San Francisco, California. His experience spans almost a decade, working with San Francisco Bay Area design and development agencies, and high-profile Silicon Valley start-ups and enterprises. After studying Multimedia at City College of San Francisco, Michael self-taught himself programming languages such as JavaScript, Node.js, PHP and founded the web development consultancy, Space-Rocket. Michael is currently working with the Elixir programming language.

Leave a comment

Your email address will not be published. Required fields are marked *