ref: cfcca03a12558c480f9cb76c5c758eaa98060d45
dir: /sys/man/8/scanmail/
.TH SCANMAIL 8
.SH NAME
scanmail, testscan \-  spam filters
.SH SYNOPSIS
.B upas/scanmail
[
.I options
]
[
.I qer-args
]
.I root
.B mail
.I sender system rcpt-list
.PP
.B upas/testscan
[
.B -avd
]
[
.B -p
.I patfile
]
[
.I filename
]
.SH DESCRIPTION
.B Scanmail
accepts a mail message supplied on standard input,
applies a file of patterns to a portion of it,
and dispatches
the message based
on the results.
It exactly replaces the
generic queuing command
.IR qer (8)
that is executed from the
.IR rc (1)
script
.B /mail/lib/qmail
in the mail processing pipeline.
Associated with each pattern is an
.I action
in order of decreasing priority:
.in +5
.TP 10
.B dump
the message is deleted and a log entry is written to
.B /sys/log/smtpd
.TP 10
.B hold
the message is placed in a queue for human inspection
.TP
.B log
a line containing the matching portion of the message is written to a log
.in -5
.PP
If no pattern matches or only patterns with an action of
.B log
match, the message is accepted and
.I scanmail
queues the message for delivery.
.I Scanmail
meshes with the blocking facilities
of
.IR smtpd (6)
to provide several layers of
filtering on gateway systems.  In all cases the sender
is notified that the message has been successfully
delivered,
leaving the sender unaware that the message has been potentially delayed or deleted.
.PP
.I Scanmail
accepts the arguments of
.IR qer (8)
as well as the following:
.TF filename
.TP
.B  -c
Save a copy of each message in a
randomly-named file in
directory
.BR /mail/copy .
.TP
.B -d
Write debugging information to standard error.
.TP
.B -h
Queue
.I held
messages by sending domain name.
The
.B -q
option must specify a root directory; messages
are queued in subdirectories of this directory.
If the
.B -h
option is not specified,
messages are accumulated in a subdirectory of
.B /mail/queue.hold
named for the contents of
.BR /dev/user ,
usually
.BR none .
.TF filename
.TP
.B -n
Messages are never held for inspection, but are delivered.  Also known as
.IR "vacation mode" .
.TP
.BI -p " filename"
Read the patterns from
.I filename
rather than
.BR /mail/lib/patterns .
.TP
.BI -q " holdroot"
Queue deliverable messages in subdirectories of
.IR holdroot .
This option is the same as the
.B -q
option of
.IR qer (8)
and must be present if the
.B -h
option is given.
.TP
.B  -s
Save deleted
messages.   Messages are stored, one per randomly-named file,
in subdirectories of
.B /mail/queue.dump
named with the date.
.TP
.B -t
Test mode.  The pattern matcher is applied but the message is
discarded and the result is not logged.
.TP
.B -v
Print the highest priority match.
This is useful
with the
.B -t
option for testing the pattern matcher without actually
sending a message.
.PD
.PP
.I Testscan
is the command line version of
.IR scanmail .
If
.I filename
is missing, it applies the pattern set to
the message on standard input.  Unlike
.IR scanmail ,
which finds the highest priority match,
.I testscan
prints all matches in the portion of the message under test.
It is useful for testing a pattern set or
implementing a personal filter
using the
.B pipeto
file in a user's mail directory.
.I Testscan
accepts the following options:
.TP
.B -a
Print matches in the complete input message
.TP
.B -d
Enable debug mode
.TP
.B -v
Print the message after conversion to canonical form
.RI ( q.v. ).
.TP
.BI -p " filename"
Read the patterns from
.I filename
rather than
.BR /mail/lib/patterns .
.SS Canonicalization
Before pattern matching, both programs convert a portion of
the message header and the beginning of the
message to a canonical form.  The amount of the header
and message body processed are set by
compile-time parameters in the source files.
The canonicalization process converts letters to lower-case and
replaces consecutive spaces, tabs and newline characters
with a single space.  HTML commands are
deleted except for the parameters following
.B A
.BR HREF ,
.B IMG
.BR SRC ,
and
.B IMG
.B BORDER
directives.  Additionally, the following MIME escape sequences
are replaced by their ASCII
equivalents:
.PP
.EX
           Escape Seq   ASCII
           ----------   -----
                =2e       .
                =2f       /
                =20    <space>
                =3d       =
.EE
and the sequence
.I =<newline>
is elided.
.I Scanmail
assembles the sender, destination domain and recipient fields of
the command line into a string that is
subjected to the same canonical processing.
Following canonicalization, the command line and
the two long strings containing
the header and the message body are passed to the
matching engine for analysis.
.SS Pattern Syntax
The matching engine compiles the pattern set
and matches it to each canonicalized input string.
Patterns are specified one per line
as follows:
.PP
.EX
	{*}\fIaction\fP: \fIpattern-spec\fP {~~\fIoverride\fP...~~\fIoverride\fP}
.EE
.PP
On all lines, a
.B #
introduces a comment; there is no way to escape this character.
.PP
Lines beginning with
.B *
contain a
.I pattern-spec
that is a string; otherwise, the
.I pattern-spec
is a regular expression in the style of
.IR regexp (6).
Regular expression matching is many
times less efficient than string matching, so it is
wiser to enumerate several similar strings
than to combine them into a regular expression.
The
.I action
is a keyword terminated by a
.B : 
and separated from the pattern by optional white-space.
It must be one of the following:
.TP 10
.B dump
if the pattern matches, the message is deleted.  If the
.B -s
command line option is set, the message is saved.
.TP 10
.B hold
if the pattern matches, the message is queued in a subdirectory
of
.B /mail/queue.hold
for manual inspection.  After inspection, the queue can be swept
manually using
.B runq
(see
.IR qer (8))
to deliver messages that were inadvertently matched.
.TP 10
.B header
this is the same as the
.B hold
action, except the pattern is only applied to the message header.
This optimization is useful for patterns that match header fields
that are unlikely to be present in the body of the message.
.TP 10
.B line
the sender and a section of the message around the match are written to
the file
.BR /sys/log/lines .
The message is always delivered.
.TP 10
.B loff
patterns of this type are applied only to the canonicalized command line.
When a match occurs, all patterns with
.B line
actions are disabled.  This is useful for limiting
the size of the log file by excluding repetitive messages, such
as those from mailing lists.
.PP
Patterns are accumulated into pattern sets sharing the same action.
The matching engine applies the
.B dump
pattern set first, then the
.B header
and
.B hold
pattern sets, and finally the
.B line
pattern set.  Each pattern set is applied three times:
to the canonicalized command line, to the message header, and
finally to the message body.  The ordering of patterns
in the pattern file is insignificant.
.PP
The
.I pattern-spec
is a string of characters terminated by a
.BR newline ,
.B #
or override indicator,
.BR ~~ .
Trailing white-space is deleted but
patterns containing leading or trailing white-space can
be enclosed in double-quote
characters.  A pattern containing a double-quote
must be enclosed in double-quote
characters and preceded by a backslash.
For example, the pattern
.PP
.EX
	"this is not \\"spam\\""
.EE
.PP
matches the string \fLthis is not "spam"\fP.
The
.I pattern-spec
is followed by zero or more
.I override
strings.  When the specific pattern matches,
each override is applied and
if one matches, it cancels the effect of the pattern.
Overrides must be strings; regular expressions are not supported.
Each override is introduced by the string
.BR ~~
and continues until a subsequent
.BR ~~ ,
.B #
or
.BR newline ,
white-space included.
A
.B ~~
immediately followed by a
.B newline
indicates a line continuation and further overrides continue
on the following line.
Leading white-space
on the continuation line is ignored.  For example,
.PP
.EX
        *hold:   sex.com~~essex.com~~sussex.com~~sysex.com~~
                 lasex.com~~cse.psu.edu!owner-9fans
.EE
.PP
matches all input containing the string
.B sex.com
except for messages that also contain the
strings in the override list.  Often it
is desirable to override a pattern based on
the name of the sender or
recipient.  For this reason, each override
pattern is applied to the header and the command line as well
as the section of the
canonicalized input containing the matching data.
Thus a pattern matching the command line or the header
searches both the command line and the header
for overrides while a match in the body searches
the body, header and command line for overrides.
.PP
The structure of the pattern file and the matching
algorithm define the strategy for detecting
and filtering unwanted messages.  Ideally, a
.B hold
pattern selects a message for inspection and if it
is determined to be undesirable, a specific
.B dump
pattern is added to delete further instances
of the message.  Additionally, it is often
useful to block the sender by updating the
.B smtpd
control file.
.PP
In this regime, patterns with a
.I dump
action, generally match phrases
that are likely to be unique.  Patterns that
hold a message for inspection
match phrases commonly found in undesirable material and
occasionally in legitimate messages.  Patterns
that log matches are less specific yet.  In all
cases the ability to override a pattern by
matching another string, allows repetitive messages
that trigger the pattern, such as mailing lists,
to pass the filter after the first one is processed
manually.  The
.B -s
option allows deleted messages to be salvaged
by either manual or semi-automatic review, supporting
the specification of more aggressive patterns.
Finally, the utility of the pattern matcher is not
confined to filtering spam; it is a generally useful
administrative tool for deleting inadvertently harmful
messages, for example, mail loops, stuck senders or viruses.
It is also useful for collecting or counting messages
matching certain criteria.
.SH FILES
.TF /mail/queue.dump/*
.TP
.B /mail/lib/patterns
default pattern file
.TP
.B /sys/log/smtpd
log of deleted messages
.TP
.B /mail/log/lines
file where
.I log
matches are logged
.TP
.B /mail/queue/*
directories where legitimate messages are queued for delivery
.TP
.B /mail/queue.hold
directory where held messages are queued for inspection
.TP
.B /mail/queue.dump/*
directory where
.I dumped
messages are stored when the
.B -s
command line option is specified.
.TP
.B /mail/copy/*
directory where copies of all incoming messages
are stored.
.SH SOURCE
.TP
.B /sys/src/cmd/upas/scanmail
.SH "SEE ALSO"
.IR mail (1),
.IR qer (8),
.IR smtpd (6)
.SH BUGS
.I Testscan
does not report a match when the body of a message
contains exactly one line.