Good Words

In addition to a list of "bad" words, you also have a list of "good" words. These are on-topic words specific to the subject matter of your mailing list.

Suppose your mailing list was about baseball.

Your list of good words might contain:

 BAT, 2
 WIN, 1
 LOSE, 1

Here you have a list of words that might come up in a discussion of baseball. Some words, like BASEBALL and INNING, are highly-specific to baseball and are unlikely to come up in other conversations. These words are given a score of 3. Other words, like STRIKE and BAT, are common in discussions of baseball, but could come up in other subjects. These words get a score of 2. Finally, words like WIN and LOSE are common to all sports and many other areas of life as well, so they are scored as just 1 point.

The moderation program will count up the occurrences of good words in each message. Each occurrence adds its score to the total. The score for a good word remains fixed, no matter how many times that word occurs in a message.

The program will compute the density of good word points, i.e. the total number of points divided by the total number of bytes in the message. An on-topic message should have a reasonable number of points for it's size. A long message with a very low point-count will be assessed a large penalty for being off-topic. This penalty might not be enough to block the message, but undesirable off-topic messages often incur other penalties. For example, they may be SPAM messages containing typical SPAM phrases. Note that almost all messages are assigned some small penalty for being off-topic, usually from 1 to 10 penalty points, which is unlikely to cause the message to be blocked, assuming a threshold of 30.

To get accurate off-topic penalties, it's recommended that you have 100 or more on-topic words. The system automatically increases the weight of the off-topic penalty as you add more good words. With a small number of words, the program can't confidently say that a message is truely off-topic, so the penalty will be small.


Bad Words

You have a list of "bad" words or phrases that might indicate an offensive or off-topic message. Initially, this list will contain over 100 words that we have chosen. Each word in the list carries a certain penalty value, as indicated by the number beside it. When you click "Save Changes" the system automatically sorts the list, and puts all words into upper case. For example,

        1-800-, 12
        CASINO, 8
        DAMN, 5
        OFFER EXPIRES, 10

You should modify this initial set of words and values to suit the subject of your mailing list, and your own policy. Some words are there to catch typical SPAM. Others are there to catch foul language. Depending on the nature of your mailing list, you might want to allow foul language, and if your list happens to be about gambling you would obviously delete the entry for the word "CASINO". Software developers might want to enter passwords, registration codes, etc. that some people on the list know, but others are expected to pay for. With ListFilter, you have a program running 24 hours per day, reading every message, ensuring that critical information does not get leaked, either accidentally or deliberately.

The moderation program checks every line in every message, looking for bad words. Bad words are matched without regard to upper/lower case. So, CASINO would match CASINO, Casino, casino, casinos etc. When a bad word on your list is 3 characters or less, it must match a complete word, not just a substring of a longer word. Otherwise short words might trigger too many spurious matches.

The score goes up with each occurrence of any bad word, but the penalty is reduced by 20% for a second occurrence of the same bad word, and another 20% for the third occurrence, and so on. Eventually the penalty (rounded to the nearest integer) will reach zero. This helps to reduce false alarms. It's more significant to see two different, equally bad, words appearing in a message, than it is to see two occurrences of the same bad word.

For example, 3 occurrences of CASINO would generate a score of 8 + 6 + 5 (rounded off to the nearest integer), rather than 8 + 8 + 8 as you might expect.

So, for example, a message that contained CASINO and OFFER EXPIRES and 1-800- and another CASINO would be scored as:

    8 + 10 + 12 + 6 = 36

If the threshold was 30, the message would not be approved. It would be forwarded to you for evaluation.

Note that this scoring system is more subtle than simply saying that any message containing CASINO must be blocked, or any message with an 800-number must be blocked. We'll soon see how even more subtlety can be added to the system. The goal is to create an artificially intelligent moderator that can make the right decision in almost all cases, while letting you have the final say about rejecting messages.


Latest Comments