22889

regex email invisible text

Question:

I am getting lots of spam with so-called "invisible" text - large blocks of gibberish hidden from view with white font color on white background or in comment tags. In cPanel "account level filters" I am trying to build a regex filter on the email body.

This one (to catch gibberish in comment tags) results in too many false positives because it catches legitimate HTML text which contains occasional comment tags:

\<![ \r\n\t]*--[\S\s]{400,6000}--[ \r\n\t]*\>

These two (for white text on white background) are not very effective - because there are so many ways to write the offending HTML - and I can't figure out how to write clever enough regex:

\<div style=\"color:white\">[ \r\n\t]*.{1500,6000}[ \r\n\t]*\<\/div> color=[\"\']*\#FFFFF[0-9A-E]

Thanks in advance for your suggestions!

<hr />

examples...

<div style="color:white"> Several paragraphs of gibberish designed to fool filters. </div> <!-- Several paragraphs of gibberish designed to fool filters. -->

Answer1:

These are decent weak indicators for detecting spam. I highly advise against using them to independently block messages. Consider a system like SpamAssassin, which actually has regexps like what you're trying to write, instead. SpamAssassin assigns a small number of points to each indicator and then sums them up to see if there was enough to label a message as spam.

SpamAssassin rules of note:

<ul><li>__HTML_COMMENT_10000</li> <li>HTML_FONT_TINY</li> <li>HTML_FONT_LOW_CONTRAST</li> </ul>

Here is a SpamAssassin rule definition to more exactly address your white-on-white issue:

<pre class="lang-perl prettyprint-override">rawbody __JOE_COLOR_WHITE /\bcolor[:=][\s\"\']{0,5}(?:white|\#[ef]{3}|\#[ef].[ef].[ef].)/i rawbody __JOE_BGCOLOR_WHITE /\b(?:bgcolor|background(?:-color)?)[:=][\s\'\"]{0,5}(?:white|\#[ef]{3}|\#[ef].[ef].[ef].)/i meta JOE_WHITE_ON_WHITE __JOE_COLOR_WHITE && __JOE_BGCOLOR_WHITE score JOE_WHITE_ON_WHITE 0.5 describe JOE_WHITE_ON_WHITE Part of the email has white text, another part has white bg

I'm matching a somewhat broader definition of "white" but that appears to be your intent ("FFFFF0" has slightly less blue. My regex is twice as broad, applies to all three RGB channels, and also matches the shorter three hex form. The weakness to the rule I defined above is that it doesn't ensure the white text is actually rendered on a white background. This should be "close enough" but may accidentally hit some non-spam marketing/newsletter mail.

Recommend

  • webViewDidFinishLoad not being called with loadHTMLString
  • Headless webkit wrapper/driver no install for .net
  • Catch db2_prepare generated warning
  • How do you verify that the notification to the Silent Post URL is indeed from PayPal Payflow and not
  • Can someone explain to me what a sentinel does in Java? Or how it works?
  • How does timed cache expiry work?
  • Is there any better way to get Currency Exchange Rate in PHP?
  • Interact with a STM32 chip's memory in C [closed]
  • Declared vs non-declared array in C
  • Distance Matrix API: Zip Code 70128 returns zero results
  • sqlalchemy force all connections to close mysql
  • How to avoid this Apple Siri https cracking scenario?
  • identifying ``method code too large`` origin
  • How can I force an AngluarJS form to realize required fields have been filled through DOM manipulati
  • Read file line by line as soon as these lines appear in the file using Apache Camel
  • How does the heap manager in java or C++ keep track of all the memory locations used by the threads
  • Refreshing i18n translated string interpolated values in Aurelia
  • Handling right-to-left/left-to-right override characters in user input
  • how to UNSELECT row in a ttk.Treeview in tkinter
  • True privateness in Python
  • Visual studio alerts workspace already exists
  • replacing while loop with list comprehension
  • Linux command line : edit hacked index files
  • cell spacing in div table
  • XSLT foreach repeating nodes to flat
  • How to create a 2D image by rotating 1D vector of numbers around its center element?
  • Thread 1: EXC_BAD_ACCESS (code =1 address = 0x0)
  • Listbox within Listbox and scrolling trouble in Windows Phone 7 Silverlight
  • Email format validation in mvc3 view
  • Insert into database using onclick function
  • Deselecting radio buttons while keeping the View Model in synch
  • C# - Is there a limit to the size of an httpWebRequest stream?
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • How to add date and time under each post in guestbook in google app engine
  • jquery mobile loadPage not working
  • Unanticipated behavior
  • Proper way to use connect-multiparty with express.js?
  • Load html files in TinyMce
  • coudnt use logback because of log4j
  • JaxB to read class hierarchy