Skip to content

Utopialvo/EDAauto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDAauto

Automatic Exploratory Data Analysis Framework

A Python library for automating the exploratory data analysis (EDA) process. Provides tools for profiling, statistical testing, outlier detection, feature engineering, and more.

Features

  • Dataset Profiling: Basic information, missing values, data types
  • Statistical Hypothesis Testing: Automated hypothesis generation and testing
  • Clustering: Automatic clustering with K-means and DBSCAN
  • Distribution Fitting: Fit probability distributions to numerical data
  • Outlier Detection: Multiple methods (IQR, Z-score, Isolation Forest)
  • Feature Engineering: Time-series features, transformations, interactions
  • Comprehensive Analysis: Unified interface for complete EDA workflow

Installation

pip install -r requirements.txt
python setup.py install

Quick Start

from autoeda import AutoEDA
import pandas as pd

# Initialize AutoEDA
eda = AutoEDA(random_state=42)

# Load your data
df = pd.read_csv('your_data.csv')

# Basic dataset info
info = eda.dataset_info(df)

# Automatic hypothesis testing
hypotheses = eda.suggest_and_test_hypotheses(df)

# Clustering analysis
clusters = eda.auto_cluster(df, method='kmeans')

# Time-series features
df_with_ts = eda.time_series_features(df, 'datetime_column')

# Comprehensive analysis
results = eda.comprehensive_analysis(df, 'datetime_col', 'target_col')

Project Structure

autoeda/
├── core.py              # Main AutoEDA class
├── analysis/            # Statistical analysis modules
│   ├── clustering.py
│   ├── distributions.py
│   └── hypothesis.py
├── preprocessing/       # Data preprocessing
│   ├── feature_engineering.py
│   └── outliers.py
├── utils/              # Utility functions
│   └── time_series.py
└── example_usage.py    # Usage examples

Key Modules

Core (AutoEDA class)

Unified interface for all EDA functionality

Dataset information and profiling

Comprehensive analysis pipeline

Analysis

Clustering: Automatic grouping of numerical data

Distributions: Fit and compare probability distributions

Hypothesis Testing: Automated statistical testing

Preprocessing

Outlier Detection: Identify and handle anomalies

Feature Engineering: Create new features automatically

Utils

Time Series: Extract datetime features (seasonality, trends, etc.)

About

Automatic Exploratory Data Analysis Framework

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages