How I handle spam

One frustrating aspect of running a public-facing contact form is dealing with spam. Especially in the age of LLMs. Bots are more prevalent now than ever. Over the years, my approach to fighting spam has become more sophisticated. Additionally, I don't want to miss important messages which might result in legitimate business prospects, so my primary goal is to allow humans through and block everything else.

These are the different layers I've implemented:

IP-based blocklist and User-agent analysis

In the early days, I'd see spam and block the corresponding IP address. This was not effective. Spammers typically use blocks of IPs, and there's no guarantee that an IP will remain the same for a single visitor. Spammers typically rotate IPs to bypass naïve IP-based blocklists.

The next obvious step is to look at the request's User-Agent. If it doesn't come from a popular browser, we can easily categorize the message as spam. Although, this doesn't work very well in practice since most spammers are aware of that and specifically craft a request to not be detected as a bot.

Hidden inputs and validations

An early stop-gap is to add an input that won't display to a user, but a bot would automatically fill it out. If the request includes the field, we can categorize it as spam. This keeps out lazy bots, but more sophisticated ones will be able to detect this and skip it.

Additionally, if the message is really short, or really long we can reject the input. This is simple to do with ActiveRecord::Validations :

ruby

class Message < ApplicationRecord
  validates :email,
            presence: true,
            format: { with: URI::MailTo::EMAIL_REGEXP },
            length: { minimum: 6, maximum: 254 }

  validates :name,
            presence: true,
            length: { maximum: 100 },
            format: { without: %r{[\d\#!/:]}i, message: "cannot contain numbers or punctuation" }

  validates :text, presence: true, length: { minimum: 25, maximum: 5000 }
  validate :email_is_not_noreply
  validate :email_is_not_blacklisted, unless: :captcha_validated?
  validate :ip_is_not_blacklisted, unless: :captcha_validated?
  validate :name_is_less_than_3_words
  validate :is_not_spam, unless: :captcha_validated?
  validate :does_not_contain_html
  validate :language_is_allowed, unless: :captcha_validated?
  validate :text_must_be_more_than_4_words, unless: :captcha_validated?
  validate :crc32_is_unique
  validate :email_has_mx_record
  validate :country_is_allowed

  # ...
end

Regex based matchers with a scoring system

The next step was to analyze the content of the messages. Spammers mentioned things like SEO optimization , and Cryptocurrency , among other things not related to my website. I developed a system for matching these keywords and phrases via regular expressions and stored them in the database along with a category and score. Using a system that used custom column types to serialize and cast Regex objects with ActiveModel::Type::Value similar to my post on Native encrypted attributes for Rails ActiveModel.

ruby

# app/lib/regex_type.rb
class RegexType < ActiveModel::Type::Value
  def cast(value)
    value = value.source if value.is_a?(Regexp)
    Regexp.new(value, Regexp::IGNORECASE) if value
  end

  def serialize(value)
    value.source rescue value
  end
end

ruby

# app/models/spam_matcher.rb
class SpamMatcher < ApplicationRecord
  # regex                         : string
  # score                         : decimal
  # spam_category                 : belongs_to
  # hidden                        : boolean
  # match_count                   : integer

  belongs_to :spam_category

  attribute :regex, RegexType.new
end

The matchers can be used along with ruby's StringScanner class to compile a final score. Anything over 1.0 is considered spam.

This approach was chosen for manual control tuned to my use case versus something like Bayesian classification based on N-Grams.

For a while this worked great. I'd get some spam, manually tweak the matchers and never get that kind of spam again.

Eventually, I noticed spammers getting creative with Unicode characters, which led to the next iteration.

Language detection and Transliteration

The regex approach isn't bullet proof. Spammers are crafty, and can bypass strict character matching in regex patterns. In the age of UTF-8, text-art has come a long way. Take the upside down text generator for example: ˙ɐᴉlɐɹʇsn∀ ɯoɹɟ sᴉ ǝƃɐssǝɯ sᴉɥ┴ , or 1337Speak. The process of normalizing substitute characters like this called transliteration. As it turns out, ActiveSupport::Inflector includes a transliterate method that integrates nicely with i18n in Rails.

The process became:

Use the CLD gem to detect the language
Transliterate common spam character substitutions
Apply regex based spam score.

Here's a small list of the transliterations I've seen in spam messages:

yaml

en:
  i18n:
    transliterate:
      rule:
        Α: "A"
        а: "a"
        α: "a"
        Β: "B"
        в: "B"
        β: "b"
        с: "c"
        С: "C"
        Ε: "E"
        ε: "e"
        е: "e"
        Е: "E"
        Ζ: "Z"
        ζ: "z"
        η: "n"
        Η: "H"
        н: "h"
        Н: "H"
        і: "i"
        Ι: "I"
        Κ: "K"
        к: "k"
        Μ: "M"
        м: "m"
        М: "M"
        Ν: "N"
        ν: "v"
        о: "o"
        О: "O"
        ρ: "p"
        Р: "P"
        ς: "s"
        С: "C"
        ѕ: "s"
        т: "t"
        Т: "T"
        υ: "y"
        Υ: "Y"
        х: "x"
        Х: "X"
        у: "y"
        У: "Y"
        ’: "'"
        ǃ: "!"

Request origin analysis (geo-ip)

Using a free API, we can further limit messages based on information about the IP address. We can deny messages from countries that we don't do business with, or ASNs that are cloud platforms which bots are typically hosted on.

ruby

class GeoIP
  def self.cache
    @cache ||= ActiveSupport::Cache::MemoryStore.new
  end

  def self.api_key
    Rails.application.credentials.ipinfo_key
  end

  def self.info(ip)
    cache.fetch("GeoIP_#{ip}", expires_in: 12.hours) do
      HTTP.get("http://ipinfo.io/#{ip}/json?token=#{api_key}").parse
    end
  end
end

Alphanumeric image-based challenge, or Captcha

Finally, once a message has been categorized as spam, we can generate an alpha-numeric image-based challenge or a captcha. I used a base64 encoded PNG image (Check out my PNG Implentation) along with the encrypted text as a hidden form field. When the user submits a message including the captcha solution, the captcha text is decrypted and compared to the submission.

This is arguably the most important part. It allows real humans, who may have unintentionally triggered some other layer of spam protection, to bypass the filter

Result

I'm happy to say, this system has filtered over 10,000 spam messages in the last few years, while still allowing legitimate communication through.

How I handle spam

IP-based blocklist and User-agent analysis

Hidden inputs and validations

Regex based matchers with a scoring system

Language detection and Transliteration

Request origin analysis (geo-ip)

Alphanumeric image-based challenge, or Captcha

Result

Previous Post:

Keep your database in sync when working across feature branches

Next Post:

The Power of Rails Generators

How I handle spam

IP-based blocklist and User-agent analysis

Hidden inputs and validations

Regex based matchers with a scoring system

Language detection and Transliteration

Request origin analysis (geo-ip)

Alphanumeric image-based challenge, or Captcha

Result

Previous Post:

Keep your database in sync when working across feature branches

Next Post:

The Power of Rails Generators

Share: