Skip to content

Latest commit

 

History

History
223 lines (184 loc) · 8.73 KB

File metadata and controls

223 lines (184 loc) · 8.73 KB

A Powerful Large Model Fine-tuning Dataset Generation and Management Tool

简体中文 | English

Dataset Generation and Large Model Fine-tuning Tool

A large model fine-tuning dataset generation and management tool that enables one-click crawling of links from specified domains, supports converting links into large model-friendly Markdown files, and supports converting Markdown files into datasets suitable for training large models through ChatGPT, Deepseek, Gemma, and other large models.

Features

  • Support for deep crawling of all links from specified domains
  • Support for converting links into large model-friendly Markdown files
  • Support for uploading .md, .txt, .pdf, .docx, .doc and other files, with automatic conversion to .md files
  • Support for intelligent algorithm-based segmentation of Markdown files
  • Support for converting Markdown files into datasets suitable for training large models through DeepSeek, ChatGPT, Gemma, and other large models
  • Support for custom addition, editing, and modification of dataset data
  • Support for exporting in JSONL and JSON formats, with Alpaca, ShareGPT, and custom formats
  • Support for previewing conversion results

Feature Screenshots

Project Management
Project Management
Link Management
Link Management
Md File Conversion
Md File Conversion
File Management
File Management
File to Dataset Conversion
File to Dataset Conversion
Data Management
Data Management
System Settings
System Settings

Quick Start

Install Dependencies

  1. Backend dependencies:
# Recommended Python version is python=3.10
# Create virtual environment
python -m venv venv
# Activate environment (in PowerShell)
.\venv\Scripts\Activate.ps1
# Install all dependencies from requirements.txt
pip install -r requirements.txt
  1. Frontend dependencies:
cd frontend
npm install

Run the Project

  1. Start the backend server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 --ws websockets
  1. Start the frontend development server:
cd frontend
npm run dev
  1. Access in browser: http://localhost:3000

Project Structure

├── app/                    # Backend application directory
│   ├── api/                # API interface directory
│   │   ├── crawler.py      # Crawler API
│   │   ├── system.py       # System API
│   │   ├── files.py        # File operations API
│   │   ├── dataset.py      # Dataset API
│   │   └── __init__.py     # Initialization file
│   ├── core/               # Core functionality
│   │   └── config.py       # Configuration file
│   ├── schemas/            # Data schemas
│   │   ├── crawler.py      # Crawler schema
│   │   ├── system.py       # System schema
│   │   ├── files.py        # Files schema
│   │   └── dataset.py      # Dataset schema
│   ├── services/           # Service layer
│   │   ├── crawler_service.py        # Crawler service
│   │   ├── crawler_engine_service.py # Crawler engine service
│   │   ├── notification_service.py   # Notification service
│   │   ├── system_service.py         # System service
│   │   ├── files_service.py          # Files service
│   │   ├── project_service.py        # Project service
│   │   └── dataset_service.py        # Dataset service
│   ├── utils/              # Utility functions
│   ├── __init__.py         # Initialization file
│   └── main.py             # Main program entry
├── frontend/               # Frontend directory
│   ├── src/                # Source code
│   │   ├── assets/         # Static resources
│   │   ├── components/     # Components directory
│   │   ├── services/       # Services
│   │   │   ├── crawler.js  # Crawler service
│   │   │   └── request.js  # Request service
│   │   ├── views/          # Views
│   │   │   ├── LinkManager.vue   # Link management page
│   │   │   └── ...         # Other view pages
│   │   ├── App.vue         # Main application component
│   │   └── main.js         # Entry file
│   ├── index.html          # HTML entry
│   ├── package.json        # Dependency configuration
│   ├── vite.config.js      # Vite configuration
│   ├── vue.config.js       # Vue configuration
│   ├── .env                # Environment variables
│   ├── .env.production     # Production environment variables
│   └── .prettierrc         # Code formatting configuration
├── config/                 # Configuration file directory
├── export/                 # Export directory
│   ├── alpaca/             # Alpaca format export
│   ├── sharegpt/           # ShareGPT format export
│   └── custom/             # Custom format export
├── logs/                   # Log directory
├── output/                 # Output directory
│   ├── crawled_urls.json   # Crawled URL list (JSON format)
│   ├── crawler_status.json # Crawler status information
│   ├── markdown/           # Converted markdown files
│   └── markdown_manager.json # Markdown file management information
├── upload/                 # Upload file directory
├── .gitignore              # Git ignore file configuration
├── README.md               # English documentation
├── README.zh-CN.md         # Chinese documentation
└── requirements.txt        # Python dependencies file

Troubleshooting

  1. If you encounter the following errors during execution: No module named 'markitdown' No module named 'onnxruntime'

    You can try installing globally:

    pip install 'markitdown[all]'
    pip install onnxruntime

🤝 Contributing Guide

We welcome community users to participate in contributing! If you have suggestions, bugs, or new feature requests, please submit them through Issues, or directly submit a Pull Request.

🎯 Ways to Contribute

  • 🐛 Bug Fixes: Discover and fix system defects
  • New Features: Propose and implement new features
  • 📚 Documentation Improvements: Enhance project documentation
  • 🧪 Test Cases: Write unit tests and integration tests
  • 🎨 UI/UX Optimization: Improve user interface and experience

📋 Contribution Process

  1. Fork the project to your GitHub account
  2. Create a feature branch git checkout -b feature/amazing-feature
  3. Commit changes git commit -m 'Add amazing feature'
  4. Push the branch git push origin feature/amazing-feature
  5. Create a Pull Request and describe the changes in detail

🎨 Code Standards

  • Follow the original code style

📝 Commit Standards

Use Conventional Commits standards:

feat: add batch document upload functionality
fix: fix vector search accuracy issues
docs: update API documentation
test: add search engine test cases
refactor: refactor document parsing module

📄 License

This project is released under the MIT license. You are free to use, modify, and distribute the code of this project, but you must retain the original copyright notice.