An advanced AI-powered CLI tool for screening individuals against adverse media with minimal false positives and zero false negatives.
Financial institutions must screen applicants against negative news ("adverse media") to comply with regulations and assess risk. Existing tools generate too many false positives, requiring expensive manual review. Analysts need an intelligent system that can:
- Accurately identify if an article is about the specific person (given name + date of birth)
- Determine sentiment whether the article portrays them negatively, positively, or neutrally
- Minimize false negatives (missed adverse media is unacceptable for compliance)
- Reduce false positives (unnecessary manual reviews are costly)
- Handle multiple languages including non-Latin scripts
- Provide explainable results with evidence and confidence scores
- False Negatives: π¨ UNACCEPTABLE - Regulatory violations, reputation damage, financial losses
- False Positives: π° COSTLY - Manual review overhead, delayed decisions, operational inefficiency
Our system employs a sophisticated ensemble approach combining rule-based algorithms, advanced NLP, and selective AI enhancement:
- Hybrid Architecture: Rule-based precision + AI-powered disambiguation
- Smart API Usage: β€3 GPT-5 calls per article through intelligent preprocessing
- Multilingual Support: Handle 100+ languages with context-aware translation
- Evidence-Based Decisions: Full explainability with quoted evidence
- Zero False Negative Design: Conservative thresholds prioritize recall over precision
β
Advanced Person Matching: Multilingual NER + fuzzy matching + phonetics + nicknames
β
Context-Aware Analysis: DOB/age verification + occupation/location cues
β
Intelligent Polarity Detection: Lexicon-based + GPT-5 disambiguation
β
Confidence Calibration: Multi-factor confidence scoring with uncertainty quantification
β
Comprehensive Evaluation: Enhanced testing framework with systematic improvement
graph TB
subgraph "Input Layer"
CLI[CLI Interface<br/>adverse-media-screen]
API_Input[Person + Article URL]
end
subgraph "Orchestration Layer"
MainService[AdverseMediaScreeningService<br/>Workflow Orchestration]
end
subgraph "Data Processing Layer"
ArticleFetcher[Article Fetcher<br/>HTTP + Content Extraction]
Processor[Article Processor<br/>Language Detection + Cleaning]
Translator[AI Translator<br/>Non-English β English]
end
subgraph "Core AI Engine"
EnsembleEngine[Ensemble Decision Engine<br/>Rule-based + AI Disambiguation]
subgraph "Feature Extraction"
FeatureExtractor[Advanced Feature Extractor<br/>NER + Fuzzy + Phonetic]
NameMatcher[Name Matching<br/>Multilingual + Nicknames]
AgeMatcher[Age/DOB Extraction<br/>Context-Aware]
ContextMatcher[Occupation/Location<br/>Cue Detection]
end
subgraph "Decision Making"
RuleEngine[Rule-Based Ensemble<br/>Weighted Feature Scoring]
AIDisambiguator[AI Disambiguator<br/>GPT-5 for Edge Cases]
end
subgraph "Polarity Analysis"
PolarityAnalyzer[Advanced Polarity Analyzer<br/>Lexicon + AI]
LexiconCheck[Adverse Terms Lexicon<br/>Domain-Specific Keywords]
SentimentAI[AI Sentiment Analysis<br/>GPT-5 for Complex Cases]
end
end
subgraph "Support Services"
AIClient[AI Client<br/>GPT-5 Integration]
ConfigManager[Configuration Manager<br/>Environment-Based Settings]
ErrorHandler[Error Handler<br/>Comprehensive Exception Hierarchy]
end
subgraph "Output Layer"
DecisionModel[Decision Model<br/>Match + Polarity + Evidence]
JSONOutput[JSON Output<br/>Structured Results]
Evidence[Evidence Extraction<br/>Quoted Text + Confidence]
end
subgraph "Evaluation System"
EnhancedEval[Enhanced Evaluation<br/>Systematic Testing]
AutoPipeline[Automated Pipeline<br/>Continuous Monitoring]
ErrorAnalysis[Error Analysis<br/>Pattern Detection]
end
%% Flow connections
CLI --> API_Input
API_Input --> MainService
MainService --> ArticleFetcher
ArticleFetcher --> Processor
Processor --> Translator
Translator --> EnsembleEngine
EnsembleEngine --> FeatureExtractor
FeatureExtractor --> NameMatcher
FeatureExtractor --> AgeMatcher
FeatureExtractor --> ContextMatcher
EnsembleEngine --> RuleEngine
RuleEngine --> AIDisambiguator
EnsembleEngine --> PolarityAnalyzer
PolarityAnalyzer --> LexiconCheck
PolarityAnalyzer --> SentimentAI
AIDisambiguator --> AIClient
SentimentAI --> AIClient
Translator --> AIClient
EnsembleEngine --> DecisionModel
DecisionModel --> JSONOutput
DecisionModel --> Evidence
%% Support service connections
MainService -.-> ConfigManager
MainService -.-> ErrorHandler
EnsembleEngine -.-> ConfigManager
%% Evaluation connections
MainService -.-> EnhancedEval
EnhancedEval --> AutoPipeline
EnhancedEval --> ErrorAnalysis
%% Styling
classDef input fill:#e1f5fe
classDef processing fill:#f3e5f5
classDef ai fill:#fff3e0
classDef output fill:#e8f5e8
classDef evaluation fill:#fce4ec
class CLI,API_Input input
class ArticleFetcher,Processor,Translator processing
class EnsembleEngine,FeatureExtractor,PolarityAnalyzer,AIClient ai
class DecisionModel,JSONOutput,Evidence output
class EnhancedEval,AutoPipeline,ErrorAnalysis evaluation
The AdvancedFeatureExtractor employs a multi-stage ensemble approach:
class AdvancedFeatureExtractor:
"""
Ensemble feature extraction combining:
- Multilingual NER (spaCy)
- Fuzzy string matching (RapidFuzz)
- Phonetic matching (Soundex, Metaphone)
- Nickname detection
- Context analysis (Β±100 characters)
"""
def extract_all_features(self, text: str, person: Person) -> AdvancedExtractedFeatures:
# 1. NER-based entity extraction
name_entities = self._extract_person_entities(text)
# 2. Multi-algorithm name matching
name_matches = self._find_name_matches(text, person.name, name_entities)
# 3. Age/DOB reference detection
age_references = self._extract_age_references(text, person.dob)
# 4. Occupation/location cue detection
occupation_refs = self._extract_occupation_references(text, person)
location_refs = self._extract_location_references(text, person)
# 5. Ensemble scoring with confidence calibration
return self._calculate_ensemble_features(...)Matching Strategies:
- Exact Match: Direct string comparison (confidence: 1.0)
- Fuzzy Match: Levenshtein distance (confidence: 0.6-0.95)
- Phonetic Match: Soundex/Metaphone (confidence: 0.4-0.8)
- Nickname Match: Built-in nickname database (confidence: 0.7-0.9)
- NER Match: spaCy entity recognition (confidence: 0.5-0.85)
The EnsembleDecisionEngine implements sophisticated decision logic:
class EnsembleDecisionEngine:
"""
Multi-stage decision process:
1. Rule-based ensemble scoring
2. AI disambiguation for edge cases
3. Polarity analysis with lexicon + AI
4. Confidence calibration and evidence aggregation
"""
def make_decision(self, person: Person, article: Article) -> Decision:
# Stage 1: Extract features
features = self.feature_extractor.extract_all_features(article.text, person)
# Stage 2: Rule-based matching with thresholds
match_score = self._calculate_match_score(features)
if match_score >= 0.7: # Strong match
match_result = MatchResult.YES
elif match_score <= 0.2: # Weak match
match_result = MatchResult.NO
else: # Ambiguous - use AI
match_result = self._ai_disambiguate(person, article, features)
# Stage 3: Polarity analysis
polarity_result = self.polarity_analyzer.analyze_polarity(
article.text, features, match_result
)
# Stage 4: Evidence aggregation and confidence scoring
return self._build_final_decision(...)Decision Thresholds:
- Strong Match (β₯0.7): High confidence "YES"
- Weak Match (β€0.2): High confidence "NO"
- Ambiguous (0.2-0.7): AI disambiguation required
- Edge Cases: Conservative bias toward "UNSURE" vs false negatives
The AdvancedPolarityAnalyzer combines lexicon-based and AI-powered analysis:
class AdvancedPolarityAnalyzer:
"""
Hybrid polarity detection:
1. Domain-specific adverse terms lexicon
2. Context-aware sentiment analysis
3. GPT-5 disambiguation for complex cases
"""
def analyze_polarity(self, text: str, features: AdvancedExtractedFeatures) -> PolarityAnalysisResult:
# Stage 1: Lexicon-based adverse term detection
adverse_terms = self._find_adverse_terms(text)
positive_terms = self._find_positive_terms(text)
# Stage 2: Context analysis around person mentions
context_sentiment = self._analyze_context_sentiment(text, features.name_matches)
# Stage 3: Rule-based polarity decision
lexicon_polarity = self._calculate_lexicon_polarity(adverse_terms, positive_terms)
# Stage 4: AI disambiguation if unclear
if self._is_polarity_ambiguous(lexicon_polarity, context_sentiment):
ai_polarity = self._ai_polarity_analysis(text, features)
return self._reconcile_polarity_analyses(lexicon_polarity, ai_polarity)
return lexicon_polarityAdverse Terms Lexicon (Examples):
- Legal: lawsuit, indictment, convicted, sentenced, fraud, embezzlement
- Financial: bankruptcy, default, sanctions, money laundering, tax evasion
- Regulatory: violation, penalty, fine, suspended, revoked, investigation
- Reputational: scandal, controversy, misconduct, corruption, bribery
# Core business entities
class Person(BaseModel):
"""Person entity with validation and aliases support"""
name: str
dob: date
aliases: List[str] = []
occupation: Optional[str] = None
location: Optional[str] = None
class Article(BaseModel):
"""Article entity with metadata and processing state"""
url: str
title: Optional[str] = None
text: Optional[str] = None
language: Language = Language.UNKNOWN
publication_date: Optional[datetime] = None
word_count: int = 0
class Decision(BaseModel):
"""Final screening decision with evidence and confidence"""
match: MatchResult # YES, NO, UNSURE
polarity: Polarity # POSITIVE, NEGATIVE, NEUTRAL, UNCLEAR
confidence: float # 0.0-1.0
evidence: List[Evidence] # Supporting evidence with quotes
reasoning: str # Human-readable explanation
api_calls_used: int # Cost tracking
processing_time_ms: int # Performance tracking
class Evidence(BaseModel):
"""Supporting evidence with source tracking"""
text: str # Quoted evidence text
confidence: float # Evidence reliability
evidence_type: str # name_match, age_reference, adverse_term
source_span: Tuple[int, int] # Character positions in original textclass AdverseMediaScreeningService:
"""Main orchestration service - coordinates the entire workflow"""
class EnsembleDecisionEngine:
"""Core decision engine - implements ensemble algorithms"""
class AdvancedFeatureExtractor:
"""Feature extraction - NER + fuzzy matching + phonetics"""
class AdvancedPolarityAnalyzer:
"""Polarity analysis - lexicon + AI sentiment analysis"""
class ArticleFetcher:
"""Article retrieval - HTTP fetching + content extraction"""
class AIClient:
"""GPT-5 integration - translation + disambiguation + sentiment"""class ProcessingConfig(BaseModel):
"""Core processing parameters"""
max_api_calls_per_article: int = 3
default_confidence_threshold: float = 0.7
enable_fuzzy_matching: bool = True
fuzzy_matching_threshold: float = 0.8
class OpenAIConfig(BaseModel):
"""AI service configuration"""
api_key: SecretStr
model: str = "gpt-4"
max_tokens: int = 1000
temperature: float = 0.1Before setting up the Adverse Media Screening System, ensure you have:
- Python 3.13+ installed (Download Python)
- Git for version control
- OpenAI API Key (required for AI-powered analysis)
- Internet connection for article fetching and API calls
# Clone the repository
git clone https://github.com/your-org/adverse-media-screening.git
cd adverse-media-screening# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Verify activation (should show venv path)
which python3# Install the package in development mode
pip install -e .
# Install additional dependencies for evaluation
pip install schedule # For automated evaluation pipeline
# Verify installation
adverse-media-screen --helpCreate your environment configuration file:
# Copy the example environment file
cp .env.example .env
# Edit the .env file with your settings
nano .env # or vim .env, or your preferred editorRequired environment variables in .env:
# OpenAI Configuration (REQUIRED)
OPENAI_API_KEY=your-openai-api-key-here
# Processing Configuration (Optional - defaults shown)
MAX_API_CALLS_PER_ARTICLE=3
DEFAULT_CONFIDENCE_THRESHOLD=0.7
ENABLE_FUZZY_MATCHING=true
FUZZY_MATCHING_THRESHOLD=0.8
# Logging Configuration (Optional)
LOG_LEVEL=INFO
LOG_FORMAT=json
# Security Configuration (Optional)
RATE_LIMIT_REQUESTS_PER_MINUTE=60
REQUEST_TIMEOUT_SECONDS=30# Test basic functionality
adverse-media-screen --version
# Test with a sample screening (requires valid OpenAI API key)
adverse-media-screen \
--name "Test Person" \
--dob "1990-01-01" \
--url "https://example.com" \
--verboseThe adverse-media-screen command provides a powerful interface for screening individuals against adverse media.
adverse-media-screen [OPTIONS] COMMANDScreen a person against an article for adverse media content.
Required Parameters:
--name- Person's full name (quoted if contains spaces)--dob- Date of birth in YYYY-MM-DD format--url- Article URL to analyze
Optional Parameters:
--output,-o- Save results to JSON file--verbose,-v- Enable detailed processing output
Display the current version of the tool.
adverse-media-screen version# Simple screening with minimal output
adverse-media-screen screen \
--name "John Smith" \
--dob "1985-03-15" \
--url "https://news.example.com/article/12345"# Detailed analysis with processing information
adverse-media-screen screen \
--name "Jane Doe" \
--dob "1978-11-22" \
--url "https://reuters.com/business/finance/article.html" \
--verbose# Screen and save results to JSON file
adverse-media-screen screen \
--name "Robert Johnson" \
--dob "1965-07-08" \
--url "https://bbc.com/news/business-12345678" \
--output screening-results.json# Handle names with special characters or multiple parts
adverse-media-screen screen \
--name "MarΓa JosΓ© GarcΓa-LΓ³pez" \
--dob "1992-12-03" \
--url "https://elpais.com/economia/articulo" \
--verbose# Process multiple people against the same article
while IFS=, read -r name dob; do
echo "Processing: $name (DOB: $dob)"
adverse-media-screen screen \
--name "$name" \
--dob "$dob" \
--url "https://example.com/article" \
--output "results_$(echo $name | tr ' ' '_').json"
done < people_list.csvThe tool outputs structured JSON with the following key sections:
{
"decision": {
"match": "yes|no|unsure", // Person identification result
"polarity": "negative|positive|neutral|unclear", // Sentiment analysis
"confidence": 0.87, // Overall confidence score (0-1)
"evidence": [...], // Supporting evidence array
"reasoning": "Human-readable explanation",
"api_calls_used": 1, // Cost tracking
"processing_time_ms": 1247 // Performance tracking
},
"person": {
"name": "Input name",
"dob": "Input date of birth"
},
"article": {
"url": "Article URL",
"title": "Extracted title",
"language": "Detected language",
"word_count": 542,
"publication_date": "2024-01-15T10:30:00Z"
}
}The evidence array contains objects with:
text: Quoted text from the articleconfidence: Reliability of this evidence (0-1)evidence_type: Type of evidence foundname_match: Direct name mentionsage_reference: Age or DOB referencesadverse_term: Negative sentiment indicatorsoccupation_match: Professional contextlocation_match: Geographic context
source_span: Character positions in original text
0- Success1- Error (configuration, API, processing, or validation)
If you're contributing to the project or need to modify the code:
# Install development tools
pip install -e .[dev]
# Or manually install dev dependencies
pip install pytest pytest-cov pytest-mock pytest-asyncio black isort mypy pre-commit# Install pre-commit hooks for code quality
pre-commit install
# Run hooks manually
pre-commit run --all-files# Run all tests
pytest
# Run with coverage report
pytest --cov=src/adverse_media_agent --cov-report=html
# Run specific test files
pytest tests/test_cli.py -v# Format code with Black
black src/ tests/
# Sort imports with isort
isort src/ tests/
# Type checking with mypy
mypy src/adverse_media_agentFor containerized deployment:
# Build the Docker image
docker build -t adverse-media-screening .
# Run with environment variables
docker run -e OPENAI_API_KEY=your-key \
adverse-media-screening \
adverse-media-screen screen \
--name "John Doe" \
--dob "1980-01-01" \
--url "https://example.com/article"# Install missing dependency
pip install schedule# Verify .env file exists and contains valid API key
cat .env | grep OPENAI_API_KEY
# Test API key validity
python3 -c "
import openai
import os
from dotenv import load_dotenv
load_dotenv()
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
print('API key is valid')
"# Ensure virtual environment is activated
source venv/bin/activate
# Reinstall in development mode
pip install -e .
# Check if command is available
which adverse-media-screen# Update certificates (macOS)
/Applications/Python\ 3.13/Install\ Certificates.command
# Or set environment variable to bypass (not recommended for production)
export SSL_VERIFY=false# Increase system limits or process articles in smaller chunks
# Check article size before processing
curl -I https://example.com/large-article- Check logs: Use
--verboseflag for detailed output - Validate input: Ensure date format is YYYY-MM-DD
- Test connectivity: Verify internet access and article URL
- API limits: Check OpenAI API usage and rate limits
- Issue tracker: Report bugs on GitHub issues page
- API Efficiency: The system is designed to use β€3 API calls per article
- Caching: Results are not cached by default; implement caching for repeated analyses
- Batch Processing: For multiple articles, process sequentially to respect API rate limits
- Article Size: Very large articles (>50KB) may require additional processing time
The system includes multiple evaluation tools for thorough testing and improvement:
Test any CSV dataset with detailed statistical analysis and improvement recommendations:
# Run comprehensive evaluation with your data
python evaluation/scripts/comprehensive_evaluation.py evaluation/datasets/your_ground_truth.csv --verbose
# Use sample data for testing
python evaluation/scripts/comprehensive_evaluation.py evaluation/datasets/sample_ground_truth.csv
# Run interactive demo
python evaluation/scripts/run_evaluation_demo.pyFeatures:
- β Flexible CSV input format
- β Complete statistical analysis (confusion matrix, precision, recall, F1, specificity)
- β Data-driven improvement recommendations
- β Multiple output formats (JSON, text report, CSV comparison)
- β Error categorization and pattern analysis
Advanced evaluation with systematic error analysis:
# The enhanced evaluation functionality is now integrated into the comprehensive evaluation script
# Use the comprehensive evaluation for all testing needs
python evaluation/scripts/comprehensive_evaluation.py evaluation/datasets/enhanced_test_dataset.csv --verboseContinuous evaluation with regression detection:
# Set up automated monitoring
python evaluation/scripts/automated_evaluation_pipeline.py evaluation/datasets/enhanced_test_dataset.csv --daemon
# Generate trend analysis
python evaluation/scripts/automated_evaluation_pipeline.py evaluation/datasets/enhanced_test_dataset.csv --trend-reportTo use the comprehensive evaluation script, prepare a CSV file with these columns:
name,dob,url,expected_match,expected_polarity,expected_confidence_min,language,description,notes
"John Doe","1980-01-01","https://example.com/article","yes","negative",0.8,"en","Fraud conviction","Clear adverse case"
"Jane Smith","1990-05-15","https://example.com/article2","no","neutral",0.0,"en","Different person","Clear non-match"category: Test case type (true_positive,true_negative,edge_case,name_variation)difficulty: Case complexity (easy,medium,hard)source: Data origin (manual,reuters,ap_news)
| Metric | Target | Description |
|---|---|---|
| False Negative Rate | <5% | π¨ Critical - missed adverse media |
| Accuracy | >85% | Overall correctness |
| Precision | >85% | Avoid false positives |
| Recall | >90% | Catch all true matches |
| Specificity | >85% | Correctly reject non-matches |
| API Efficiency | >0.5 | Decisions per API call |
| Metric | Typical Value | Notes |
|---|---|---|
| Processing Time | 1-3 seconds | Per article analysis |
| API Calls | 0-3 per article | Smart optimization |
| Memory Usage | <100MB | Efficient text processing |
| Throughput | 20-60 articles/minute | Depends on complexity |
| Languages | 100+ supported | Via AI translation |
src/adverse_media_agent/ # Main application code
βββ cli.py # CLI interface
βββ models/ # Domain models
β βββ core.py # Person, Article, Evidence
β βββ decision.py # Decision, ScreeningResult
β βββ enums.py # MatchResult, Polarity, Language
β βββ stats.py # Performance tracking
βββ services/ # Business logic
β βββ main_service.py # Workflow orchestration
β βββ ensemble_decision_engine.py # Core decision logic
β βββ advanced_feature_extractor.py # NER + fuzzy matching
β βββ advanced_polarity_analyzer.py # Sentiment analysis
β βββ article_fetcher.py # HTTP + content extraction
β βββ ai_client.py # GPT-5 integration
βββ config/ # Configuration system
β βββ base.py # Base configuration
β βββ main_config.py # Main config aggregation
β βββ openai_config.py # AI service settings
β βββ processing_config.py # Processing parameters
βββ processor.py # Text processing utilities
βββ exceptions.py # Error handling
evaluation/ # Evaluation system
βββ scripts/ # Evaluation tools and scripts
β βββ comprehensive_evaluation.py # Main evaluation tool
β βββ automated_evaluation_pipeline.py # Continuous monitoring
β βββ run_evaluation_demo.py # Interactive demo
βββ datasets/ # Test datasets
β βββ sample_ground_truth.csv # Balanced sample data
β βββ enhanced_test_dataset.csv # Extended test cases
βββ results/ # Evaluation results
βββ documentation/ # Evaluation documentation
tests/ # Unit tests
βββ test_cli.py # CLI testing
βββ test_models.py # Model testing
βββ test_fetcher.py # Fetcher testing
βββ test_processor.py # Processor testing
coverage/ # Test coverage reports
βββ html/ # HTML coverage reports
# Run all tests
pytest
# Run with coverage
pytest --cov=src/adverse_media_agent --cov-report=html
# Run specific test category
pytest tests/test_models.py -vThis project includes comprehensive documentation organized by purpose:
docs/ # π All project documentation
βββ README.md # Documentation overview and guide
βββ development/ # π οΈ Development documentation
β βββ plan.md # Development roadmap and current status
β βββ REORGANIZATION_PLAN.md # Architecture reorganization details
β βββ ITERATION_SUMMARY.md # Development history and sprints
β βββ INFRASTRUCTURE_IMPROVEMENTS.md # Infrastructure enhancements
βββ operations/ # π§ Operational documentation
βββ EVALUATION_ORGANIZATION_SUMMARY.md # System organization guide
evaluation/documentation/ # π Evaluation system documentation
βββ EVALUATION_USAGE_GUIDE.md # Detailed evaluation instructions
βββ EVALUATION_IMPROVEMENT_PLAN.md # Enhancement strategies
βββ EVALUATION_SYSTEM_README.md # Technical evaluation documentation
βββ EVALUATION_SUMMARY.md # Executive summary
| Purpose | Document | Description |
|---|---|---|
| Getting Started | README.md | This file - project overview and quick start |
| Development | docs/development/plan.md | Current status and development roadmap |
| Testing & Evaluation | evaluation/README.md | Complete evaluation system guide |
| Documentation Guide | docs/README.md | Navigation guide for all documentation |
| System Organization | docs/operations/EVALUATION_ORGANIZATION_SUMMARY.md | File structure and organization changes |
- π₯ Users & Stakeholders: Start with this README.md
- π¨βπ» Developers: See docs/development/ for development guides
- π¬ QA & Testing: Check evaluation/ for testing framework
- π§ Operations: Review docs/operations/ for operational guides
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β€οΈ for financial compliance and risk management