A Powerful Large Model Fine-tuning Dataset Generation and Management Tool

Dataset Generation and Large Model Fine-tuning Tool

A large model fine-tuning dataset generation and management tool that enables one-click crawling of links from specified domains, supports converting links into large model-friendly Markdown files, and supports converting Markdown files into datasets suitable for training large models through ChatGPT, Deepseek, Gemma, and other large models.

Features

Support for deep crawling of all links from specified domains
Support for converting links into large model-friendly Markdown files
Support for uploading .md, .txt, .pdf, .docx, .doc and other files, with automatic conversion to .md files
Support for intelligent algorithm-based segmentation of Markdown files
Support for converting Markdown files into datasets suitable for training large models through DeepSeek, ChatGPT, Gemma, and other large models
Support for custom addition, editing, and modification of dataset data
Support for exporting in JSONL and JSON formats, with Alpaca, ShareGPT, and custom formats
Support for previewing conversion results

Feature Screenshots

Project Management	Link Management	Md File Conversion
File Management	File to Dataset Conversion	Data Management
System Settings

Quick Start

Install Dependencies

Backend dependencies:

# Recommended Python version is python=3.10
# Create virtual environment
python -m venv venv
# Activate environment (in PowerShell)
.\venv\Scripts\Activate.ps1
# Install all dependencies from requirements.txt
pip install -r requirements.txt

Frontend dependencies:

cd frontend
npm install

Run the Project

Start the backend server:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 --ws websockets

Start the frontend development server:

cd frontend
npm run dev

Access in browser: http://localhost:3000

Project Structure

├── app/                    # Backend application directory
│   ├── api/                # API interface directory
│   │   ├── crawler.py      # Crawler API
│   │   ├── system.py       # System API
│   │   ├── files.py        # File operations API
│   │   ├── dataset.py      # Dataset API
│   │   └── __init__.py     # Initialization file
│   ├── core/               # Core functionality
│   │   └── config.py       # Configuration file
│   ├── schemas/            # Data schemas
│   │   ├── crawler.py      # Crawler schema
│   │   ├── system.py       # System schema
│   │   ├── files.py        # Files schema
│   │   └── dataset.py      # Dataset schema
│   ├── services/           # Service layer
│   │   ├── crawler_service.py        # Crawler service
│   │   ├── crawler_engine_service.py # Crawler engine service
│   │   ├── notification_service.py   # Notification service
│   │   ├── system_service.py         # System service
│   │   ├── files_service.py          # Files service
│   │   ├── project_service.py        # Project service
│   │   └── dataset_service.py        # Dataset service
│   ├── utils/              # Utility functions
│   ├── __init__.py         # Initialization file
│   └── main.py             # Main program entry
├── frontend/               # Frontend directory
│   ├── src/                # Source code
│   │   ├── assets/         # Static resources
│   │   ├── components/     # Components directory
│   │   ├── services/       # Services
│   │   │   ├── crawler.js  # Crawler service
│   │   │   └── request.js  # Request service
│   │   ├── views/          # Views
│   │   │   ├── LinkManager.vue   # Link management page
│   │   │   └── ...         # Other view pages
│   │   ├── App.vue         # Main application component
│   │   └── main.js         # Entry file
│   ├── index.html          # HTML entry
│   ├── package.json        # Dependency configuration
│   ├── vite.config.js      # Vite configuration
│   ├── vue.config.js       # Vue configuration
│   ├── .env                # Environment variables
│   ├── .env.production     # Production environment variables
│   └── .prettierrc         # Code formatting configuration
├── config/                 # Configuration file directory
├── export/                 # Export directory
│   ├── alpaca/             # Alpaca format export
│   ├── sharegpt/           # ShareGPT format export
│   └── custom/             # Custom format export
├── logs/                   # Log directory
├── output/                 # Output directory
│   ├── crawled_urls.json   # Crawled URL list (JSON format)
│   ├── crawler_status.json # Crawler status information
│   ├── markdown/           # Converted markdown files
│   └── markdown_manager.json # Markdown file management information
├── upload/                 # Upload file directory
├── .gitignore              # Git ignore file configuration
├── README.md               # English documentation
├── README.zh-CN.md         # Chinese documentation
└── requirements.txt        # Python dependencies file

Troubleshooting

If you encounter the following errors during execution: No module named 'markitdown' No module named 'onnxruntime'

You can try installing globally:
```
pip install 'markitdown[all]'
pip install onnxruntime
```

🤝 Contributing Guide

We welcome community users to participate in contributing! If you have suggestions, bugs, or new feature requests, please submit them through Issues, or directly submit a Pull Request.

🎯 Ways to Contribute

🐛 Bug Fixes: Discover and fix system defects
✨ New Features: Propose and implement new features
📚 Documentation Improvements: Enhance project documentation
🧪 Test Cases: Write unit tests and integration tests
🎨 UI/UX Optimization: Improve user interface and experience

📋 Contribution Process

Fork the project to your GitHub account
Create a feature branch git checkout -b feature/amazing-feature
Commit changes git commit -m 'Add amazing feature'
Push the branch git push origin feature/amazing-feature
Create a Pull Request and describe the changes in detail

🎨 Code Standards

Follow the original code style

📝 Commit Standards

Use Conventional Commits standards:

feat: add batch document upload functionality
fix: fix vector search accuracy issues
docs: update API documentation
test: add search engine test cases
refactor: refactor document parsing module

📄 License

This project is released under the MIT license. You are free to use, modify, and distribute the code of this project, but you must retain the original copyright notice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Generation and Large Model Fine-tuning Tool

Features

Feature Screenshots

Quick Start

Install Dependencies

Run the Project

Project Structure

Troubleshooting

🤝 Contributing Guide

🎯 Ways to Contribute

📋 Contribution Process

🎨 Code Standards

📝 Commit Standards

📄 License

FilesExpand file tree

README.en.md

Latest commit

History

README.en.md

File metadata and controls

Dataset Generation and Large Model Fine-tuning Tool

Features

Feature Screenshots

Quick Start

Install Dependencies

Run the Project

Project Structure

Troubleshooting

🤝 Contributing Guide

🎯 Ways to Contribute

📋 Contribution Process

🎨 Code Standards

📝 Commit Standards

📄 License