📢 filters
The spamfilter library comes with many built-in filtering algorithms that can determine whether a string is spam or rather not. In this document, all of these filters will be listed.
How to import filters
There are a lot of ways to import the filters you need. If you just want to use a pre-made model for filtering,
you don't need to import them at all, they are going to be imported into the library by itself - just use spamfilter.premade.
If you want to import the built-in filters into your script, do it using one of the following ways:
For one single filter:
from spamfilter.filters import Filter
For several filters:
from spamfilter.filters import (
    Capitals,
    Length,
    SpecialChars
)
For all filters
from spamfilter.filters import *
Please note that wildcard imports are generally not recommended and may raise linter warnings.
All filters explained
Generally, all filters are stacked onto each other using a pipeline object which will then check them one after each other.
You construct a filter like this:
Filter(**options, mode = "normal")
Each filter also has a check(string: str) method which accepts a string as an input and will return the filter's assesment of it using the options given at construction as a tuple.
The tuple is built as following:
(passed: int, output_string: str)
- passed indicated whether the string did complete the check successfully and therefore wasn't indicated as spam.
- output_string is the string returned by the filter as it might do corrections on it like lower-casing all letters in case it contains too many capital letters.
            spamfilter.filters.Filter
    The filter base class the other filters inherit from.
- Filter.check(string: str): check a string against this filter.
            check(string: str) -> tuple[bool, str]
    Checks if a given string passes the filter's criteria.
Returns a tuple containing a boolean (whether it passed) and optionally a string (modified version of the input string made by the filter to mitigate errors - might not be given, depending on the mode selected).
The base Filter class does not modify the string, so it always returns
True and the original string. Any other filter that inherits from
this class should override this method to implement its specific
filtering logic.
            spamfilter.filters.SpecialChars
    
              Bases: Filter
Check if a string contains too many special characters.
- SpecialChars.percentage: how many percent of the text need to be special characters for it to fail.
- SpecialChars.mode: how to handle a failing string.- normal: fail the string if it contains too many special characters
- crop: remove all special characters from the string if it would fail, but then make the string pass.
 
- SpecialChars.symboldef: what to identify as a symbol- explicit: everything that matches- SpecialChars.specialcharset.
- implicit: everything that does not match- SpecialChars.charset.
 
- SpecialChars.abs_safe_min: absolute amount of special characters that are always okay to use.
Was called "Symbols" prior to v2.0.0, which was a breaking change.
            spamfilter.filters.Capitals
    
              Bases: Filter
Check if a string contains too many capital letters.
- Capitals.percentage: how many percent of the text need to be in capital for it to fail.
- Capitals.abs_safe_min: the absolute amound of capital characters that are always okay. Set to -1 to deactivate.
- Capitals.mode: how to handle a failing string.- normal: fail the string
- crop: crop all letters to lowercase if the string is too capital, makes it always pass
 
            spamfilter.filters.Length
    
              Bases: Filter
Checks if a string matches given length requirements.
- Length.min: The inclusive minimum length.
- Length.max: The inclusive maximum length.
- Length.padding: A character used to fill up strings that are too short in the- cropmode.
- Length.mode: How to handle failing strings.- normal: Fail too short or too long strings.
- crop: Shorten too long strings and fill too short strings up using- Length.padding.
 
            spamfilter.filters.Blocklist
    
              Bases: Filter
Filter text for blocked words. Works better in combination with
BypassDetector.
- Blocklist.mode: How to handle incoming text.- normal: search for profane words adjacent to punctuation or spaces.
- strict: search for any occurence of a profane word. WARNING: this might detect words like "classic" as they contain parts of a profane words.
- tolerant: simply replace the problematic words.
 
- Blocklist.blocklist: a set with words that shall be blocked.
- Blocklist.ignore_regex: a regular expression that matches punctuation characters for splitting the string in non-strict mode.
- Blocklist.profanity_replacement: what to replace profanity with.
            spamfilter.filters.BlocklistFromJSON
    
              Bases: Blocklist
Behaves just like the Blocklist class. Reads a JSON list and inserts it's
content into the Blocklist.blocklist property.
- BlocklistFromJSON.file: filename or path of the JSON file.
            spamfilter.filters.API
    
              Bases: Filter
JSON API-based, synchronous spam filter. Requires installation with the
optional API dependencies: pip install spamfilter[api].
If using AI/ML, please make sure to have read the warnings in the documentation.
- API.url: API URL to call.
- API.headers: dictionary of headers to pass to the API
- API.method: whether to use GET (- get) or POST (- post)
- API.payload_func: function called before the request to the API is sent; needs to convert the passed argument, the text string, to a dictionary with the correct payload format used by your API of choice.
- API.interpretation_func: function called after the response arrives; gets the JSON response passed to it and needs to figure out if the filter shall pass. Needs to return a tuple of a boolean and the modified string.
- API.timeout: After how many seconds the request to the API shall time out.
- 
API.mode: currently, only "normal" is supported.
- 
API.check(string: str): send this string to the API and check the response JSON against the provided
            spamfilter.filters.Regex
    
              Bases: Filter
Check if a string matches a given regular expression.
- Regex.mode: how to handle a failing string.- normal: fail the string.
- censor: censor the match.
 
- Regex.regex: the regex used to check for matches.
- Regex.replacement: what regex to replace matches with.
Was called "PersonalInformation" prior to v2.0.0, which was a breaking change.
            spamfilter.filters.Email
    
              Bases: Regex
Check if a string contains an email address.
- Email.mode: how to handle a failing string.- normal: fail the string.
- censor: censor the information.
 
- Email.regex: the regex used to check for email addresses.
- Email.replacement: what regex to replace email addresses with.
Note: The default regex used by the Email filter is ([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+).
            spamfilter.filters.OpenAI
    
              Bases: Filter
A filter that connects to an OpenAI API endpoint to check if a given string is spam or not. This might introduce significant latency in your pipeline, so use this with caution and only if necessary.
For connecting to remote instances, this may require an API key.
Please make sure to have read the warnings in the documentation.
This filter requires the openai Python package to be installed, which can
be done with pip install spamfilter[openai].
- OpenAI.model: the model to use for checking spam.
- OpenAI.mode: how to handle a failing string.- normal: fail the string.
- correcting: correct the string if it is spam, always allow it
 
- OpenAI.base_url: the base URL of the OpenAI API endpoint.
- OpenAI.api_key: API key to use for authentication.
- OpenAI.prompt: the prompt to use for the LLM.
- OpenAI.json_schema: the json schema to use for formatted outputs.
- OpenAI.options: the options to use for the OpenAI API request.
- OpenAI.response_parsing_function: a function that takes the response from the OpenAI API and returns a tuple of a boolean (whether it is spam) and the potentially corrected string.
It is highly recommended to adjust most of these paramters to your needs,
especially the OpenAI.model and OpenAI.prompt parameters.
            check(string: str) -> tuple[bool, str]
    Sends the given string to the OpenAI API and checks if it is spam or not. Returns a tuple of a boolean (whether it is spam) and the potentially corrected string.
Note: The default response parsing function is the following:
RespFuncType = Callable[[dict[str, Union[bool, str]]], Tuple[bool, str]]
STD_RESP_FUNC: RespFuncType = lambda resp: (  # type: ignore
    not resp["is_spam"],
    (resp["corrected_text"] if "corrected_text" in resp else ""),
)
Note: The Ollama filter has been deprecated in favor of the OpenAI filter.
Ollama exposes an OpenAI-compatible API, so you can use the OpenAI filter with it.
            spamfilter.filters.MLTextClassifier
    
              Bases: Filter
A filter that instantiates a 🤗 Transformers text classification pipeline and uses it to classify text as spam or not. Note that machine learning is almost never 100% accurate, so this filter may not always return the correct result and let harmful content pass through it on occasion.
Thus, please make sure to have read the warnings in the documentation.
This filter requires the transformers Python package to be installed,
which can be done with pip install spamfilter[transformers].
- MLTextClassifier.__init__.model: the model to use for checking spam.
- MLTextClassifier.mode: how to handle a failing string.- normal: fail the string.
 
- MLTextClassifier.response_parsing_function: a function that parses the response from the model and returns a boolean indicating whether the string is spam or not.
WARNING: The standard model is a hate detection model which will be automatically pulled from Hugging Face (~ 500 MB). You may want to use a more suitable model for your use case, such as a custom spam detection model for email spam detection.
            check(string: str) -> tuple[bool, str]
    Checks if the given string is spam or not using the text classification
model specified in self.model.
It will not alter the string in any way - if you need corrections, use
the OpenAI filter in correcting mode instead.
Note: The default response parsing function is the following:
def _default_response_parsing_function(
    result: list[dict[str, Union[str, float]]],
) -> bool:
    """
    Default response parsing function that checks if the label
    is 'spam' or 'toxic'.
    """
    sorted_result = sorted(result, key=lambda x: x["score"], reverse=True)
    top_label: str = str(sorted_result[0]["label"]).lower()
    return top_label not in ["spam", "toxic", "hate", "abusive"]
Incorporating these filters
If you want to use these filters, please don't use them roughly as Filter instance but rather wrapped into a Pipeline object.