Verify HTTP links with awesome_bot

When writing articles, blog posts, technical papers, README's etc, it's common to reference other web pages. Your content may not move, but there is no guarantee other peoples stuff won't - sites go down, links change and certificates expires. Nothing can be as frustrating as finding that 6 year old blog post with the answers to your problems, but then find it full of dead links. Thankfully you can set a good example and check your own writing for dead links with a little Ruby-tool called awesome_bot.

The awesome_bot name comes from the "awesome pages" that are rather popular on GitHub. If you don't know about Awesome, it's usually collections of links with "awesome" tools, resources, libraries etc. for a specific topic or ecosystem. Examples are awesome-dotnet, awesome-react, awesome-ai-awesomeness or awesome-ci - the list is endless. Needless to say, these pages contain lots and lots of links, hence the need for validating the aliveness of these links.

If you are into Ruby and gems, awesome_bot can be installed with gem:

gem install awesome_bot  

I prefer the container approach, and dkhamsing/awesome_bot image can be used.

An example execution of awesome_bot might look like this:

$ docker run --rm -v $(pwd):/mnt -t andmos/awesome-bot -f **/*.csv --allow-redirect --allow 429
> Checking links in misc/Roasteries.csv
> Will allow errors: 429
> Will allow redirects
Links to check: 14  
  01. https://www.kaffebrenneriet.no/
  02. https://www.timwendelboe.no/
  03. https://sh.no/
  04. https://www.kaffa.no/
  05. https://www.srw.no/
  06. https://www.fjellbrent.no/
  07. https://jacobsensvart.no/
  08. https://www.facebook.com/stormkaffe
  09. https://www.pala.no/
  10. https://www.langorakaffe.no/
  11. https://inderoy.coffee/
  12. https://bonneribyen.no/
  13. https://www.facebook.com/brentkaffe/
  14. https://senjaroasters.com/
Checking URLs: ✓✓✓✓✓✓✓→✓✓✓✓✓✓  
No issues :-)  

In this example we check links from CSV files and use the flags --allow-redirect to, well, allow redirects (which throws errors if not given) and --allow 429 to whitelist the "Too many requests" status code.

If something is off with a link, like a 404, awesome_bot will throw an exit-code and show issues in the report:

$ docker run --rm -v $(pwd):/mnt andmos/awesome-bot -f *.md
> Checking links in README.md
Links to check: 1  
  1. https://www.an.no/some/dead/link
Checking URLs: x

Issues :-(  
> Links
  1. [L1] 404 https://www.an.no/some/dead/link
> Dupes
  None ✓

Wrote results to ab-results-README.md.json  
Wrote filtered results to ab-results-README.md-filtered.json  
Wrote markdown table results to ab-results-README.md-markdown-table.json  

awesome_bot can be automated to run scheduled with you favorite CI system - here is a GitHub Actions example:

name: Verify Links  
on:  
  pull_request:
  workflow_dispatch:
  schedule:
    - cron:  '0 13 * * 1'
jobs:  
  Awesome-bot:
    name: Run Awesome-bot
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Verify Links
        run: |
          docker run --rm -v $(pwd):/mnt andmos/awesome-bot -f *.md --allow-redirect --allow 429 --allow-ssl --white-list "nasdaq.com,researchgate.net"