alexa-monitor

0.1.0-SNAPSHOT


Simple web crawler to guide new programmers in Clojure world.

dependencies

org.clojure/clojure
1.11.0-alpha1
hato
0.8.1
hickory
0.7.1
org.clojure/java.jdbc
0.7.12
org.xerial/sqlite-jdbc
3.34.0
overtone/at-at
1.2.0
org.clojure/test.check
1.1.0



(this space intentionally left almost blank)
 

Obviously, the entry point of our application. Here we import collector and database namespace and call some functions within them.

(ns alexa-monitor.core
  (:require [alexa-monitor.collector :as collector]
            [alexa-monitor.database :as database]
            [overtone.at-at :as at])
  (:gen-class))

Create thread pool for at-at scheduling.

(def thread-pool (at/mk-pool))

Crawl the page and grab some useful data via alexa-monitor.collector and put those data in database, using alexa-monitor.database.

This function calls database/domain-list and gets a list of current domains to watch. Each domain is inside a hash-map. Something like {:domain_id 1, :domain "pouyacode.net"} and sends it to collector/main to add more information to that hash-map.

Then we (dissoc :domain) from it to make it ready for our rank table. and send the result to database/new-entry to be inserted into database.

(defn update-rank
  [& args]
  (let [domains (database/domain-list)]
    (doall (map #(-> %
                     collector/main
                     (dissoc :domain)
                     database/new-entry)
                domains))))

Create a simple schedule and call update-rank every 5 minutes.

(defn -main
  [& args]
  (at/every 300000 update-rank thread-pool :desc "updating ranks"))
 

Here we handle everything relative to crawling the webpage and extracting the parts we need. Let's choose cool function names; If it's about getting some html page, why not call it web-grab?!

Here we need 3 different things from the webpage, so we have one function for each! They look a lot alike, but I decided not to mix them into one function, so the result would be easier to read and maintain. Maybe later we can find a way to remove duplication without making it look messy.

(ns alexa-monitor.collector
  (:require [hato.client :as web]
            [hickory.core :as hick]
            [hickory.select :as s])
  (:gen-class))

Trim string and return its digit part.

If it finds any digit in the provided string, It returns clojure.lang.BigInt. If it's an empty string, we return 0 manually, because bigint function returns an exception if we call it with empty string (i.e. (bigint ""))

(defn digitize
  [string]
  (->> string
       (re-seq #"[\d]+")                ; Get all digits in the string,
       (#(if (nil? %)                   ; Empty string returns `nil`.
           0                            ; If it was an empty string, return `0`.
           (bigint (clojure.string/join %))))))

Retrieve the page and returns html output.

Creates a simple HTTP GET request, ignores the status and only returns the HTML part.

The url is hard-coded here. You need to run a simple webserver to serve contents of resources directory on port 8000. I use python3 -m http.server 8000 for development.

You need to change this url to https://www.alexa.com/minisiteinfo/ If you want to create actual requests to alexa.

TODO: Error handling.

(defn web-grab
  [url]
  (println "Getting updates for: " url)
  (try (:body
        #_(web/get (str "http://localhost:8000/" url "/"))
        (web/get (str "https://www.alexa.com/minisiteinfo/" url)))
       (catch Exception e "")))

Generate hiccup from html input.

Let's have 'hickory' handle everything.

(defn hiccupize
  [html]
  (-> html
      hick/parse
      hick/as-hickory))

Proces the hiccup and read Alexa Rank Using multimethods to create simple layer of abstraction

(defmulti scrape :what)

Extract the rank of our domain from the webpage and gigitize it. Then return the result to be added to domain-map

(defmethod scrape :rank
  [params hiccup]
  (-> (s/select (s/child (s/class "nopaddingbottom")
                         (s/tag :div)
                         (s/tag :a))
                hiccup)
      first
      :content
      second
      digitize))

Extract the backlink count, then digitize it. Then return the result to be added to domain-map

(defmethod scrape :backlink
  [domain-map hiccup]
  (-> (s/select (s/child (s/class "nopaddingbottom")
                         (s/tag :div)
                         (s/class "nomargin")
                         (s/tag :a))
                hiccup)
      first
      :content
      first
      digitize))

Entry point.

Creates a nice data-flow through every function of this namespace. First request the url and create a hicuup its HTML elements, Then extract rank and backlink from it. And add everything to provided domain-map. We don't need to convert bigint from digitize to int.

(defn main
  [domain-map]
  (let [hiccup (-> :domain
                   domain-map
                   web-grab
                   hiccupize)]
    (-> domain-map
        (conj {:rank (scrape {:what :rank} hiccup)})
        (conj {:backlink (scrape {:what :backlink} hiccup)}))))
 

Small namespace with few useful functions to work with database.

(ns alexa-monitor.database
  (:require [clojure.java.jdbc :refer :all])
  (:gen-class))

We currently use sqlite. Since it's a small project, even a CSV file could do the trick. But let's keep things professional! MySQL or similar database would eliminate some troubles we have with java.io.File and its weird behavior on different stages of our project (repl, uberjar, native-image), but we'll try to make it work. If we couldn't go on, we'll switch to MySQL (or preferably PostgrSQL)

(def db
  {:classname "org.sqlite.JDBC"
   :subprotocol "sqlite"
   :subname (.getAbsolutePath (new java.io.File "db/database.db"))})

Create database and table.

For now, we create the db directory and run this function manually (in repl) when we need to create database. Check out this (beginner friendly) issue if you want to write error handling for this part and create directory and database automatically if the software couldn't find database.

(defn create-db
  []
  (try (db-do-commands
        db
        (create-table-ddl :rank
                          [[:id :integer :primary :key :asc]
                           [:domain_id :integer]
                           [:rank :integer]
                           [:backlink :integer]
                           [:timestamp :datetime :default :current_timestamp]
                           [:foreign :key "(domain_id)" :references :domains "(domain_id)"]]))
       (catch Exception e
         (.println *err* (.getMessage e))))
  (try (db-do-commands
        db
        (create-table-ddl :domains
                          [[:domain_id :integer :primary :key :asc]
                           [:domain :text]]))
       (catch Exception e
         (.println *err* (.getMessage e)))))

Get list of tracked websites from domains table. The result would be like: ({:domain_id 1, :domain "domain.com"} {:domain_id 2 :domain "domain.net")

(defn domain-list
  []
  (-> db
      (query ["SELECT * from domains"])))

Get the last recorded rank for given domain name. It returns a lazy sequence.

(defn last-rank
  [domain-id]
  (-> db
      (query [(str "SELECT rank FROM rank where domain_id='" domain-id "' order by id desc limit 1")])
      first
      :rank))

Check if current rank is not= the last recorded rank. Then insert new data rank table.

(defn new-entry
  [domain-map]
  (if-not (nil? (:rank domain-map))
    (let [domain (:domain_id domain-map)
          last-record (last-rank domain)]
      (if (not= last-record (:rank domain-map))
        (do
          (insert! db "rank" domain-map)
          (println "New rank:" domain-map))
        (println "No change:" domain-map)))
    (.println *err* "Connection error:" domain-map)))