A large model fine-tuning dataset generation and management tool that enables one-click crawling of links from specified domains, supports converting links into large model-friendly Markdown files, and supports converting Markdown files into datasets suitable for training large models through ChatGPT, Deepseek, Gemma, and other large models.
- Support for deep crawling of all links from specified domains
- Support for converting links into large model-friendly Markdown files
- Support for uploading .md, .txt, .pdf, .docx, .doc and other files, with automatic conversion to .md files
- Support for intelligent algorithm-based segmentation of Markdown files
- Support for converting Markdown files into datasets suitable for training large models through DeepSeek, ChatGPT, Gemma, and other large models
- Support for custom addition, editing, and modification of dataset data
- Support for exporting in JSONL and JSON formats, with Alpaca, ShareGPT, and custom formats
- Support for previewing conversion results
Project Management |
Link Management |
Md File Conversion |
File Management |
File to Dataset Conversion |
Data Management |
System Settings |
- Backend dependencies:
# Recommended Python version is python=3.10
# Create virtual environment
python -m venv venv
# Activate environment (in PowerShell)
.\venv\Scripts\Activate.ps1
# Install all dependencies from requirements.txt
pip install -r requirements.txt- Frontend dependencies:
cd frontend
npm install- Start the backend server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 --ws websockets- Start the frontend development server:
cd frontend
npm run dev- Access in browser:
http://localhost:3000
├── app/ # Backend application directory
│ ├── api/ # API interface directory
│ │ ├── crawler.py # Crawler API
│ │ ├── system.py # System API
│ │ ├── files.py # File operations API
│ │ ├── dataset.py # Dataset API
│ │ └── __init__.py # Initialization file
│ ├── core/ # Core functionality
│ │ └── config.py # Configuration file
│ ├── schemas/ # Data schemas
│ │ ├── crawler.py # Crawler schema
│ │ ├── system.py # System schema
│ │ ├── files.py # Files schema
│ │ └── dataset.py # Dataset schema
│ ├── services/ # Service layer
│ │ ├── crawler_service.py # Crawler service
│ │ ├── crawler_engine_service.py # Crawler engine service
│ │ ├── notification_service.py # Notification service
│ │ ├── system_service.py # System service
│ │ ├── files_service.py # Files service
│ │ ├── project_service.py # Project service
│ │ └── dataset_service.py # Dataset service
│ ├── utils/ # Utility functions
│ ├── __init__.py # Initialization file
│ └── main.py # Main program entry
├── frontend/ # Frontend directory
│ ├── src/ # Source code
│ │ ├── assets/ # Static resources
│ │ ├── components/ # Components directory
│ │ ├── services/ # Services
│ │ │ ├── crawler.js # Crawler service
│ │ │ └── request.js # Request service
│ │ ├── views/ # Views
│ │ │ ├── LinkManager.vue # Link management page
│ │ │ └── ... # Other view pages
│ │ ├── App.vue # Main application component
│ │ └── main.js # Entry file
│ ├── index.html # HTML entry
│ ├── package.json # Dependency configuration
│ ├── vite.config.js # Vite configuration
│ ├── vue.config.js # Vue configuration
│ ├── .env # Environment variables
│ ├── .env.production # Production environment variables
│ └── .prettierrc # Code formatting configuration
├── config/ # Configuration file directory
├── export/ # Export directory
│ ├── alpaca/ # Alpaca format export
│ ├── sharegpt/ # ShareGPT format export
│ └── custom/ # Custom format export
├── logs/ # Log directory
├── output/ # Output directory
│ ├── crawled_urls.json # Crawled URL list (JSON format)
│ ├── crawler_status.json # Crawler status information
│ ├── markdown/ # Converted markdown files
│ └── markdown_manager.json # Markdown file management information
├── upload/ # Upload file directory
├── .gitignore # Git ignore file configuration
├── README.md # English documentation
├── README.zh-CN.md # Chinese documentation
└── requirements.txt # Python dependencies file
-
If you encounter the following errors during execution: No module named 'markitdown' No module named 'onnxruntime'
You can try installing globally:
pip install 'markitdown[all]' pip install onnxruntime
We welcome community users to participate in contributing! If you have suggestions, bugs, or new feature requests, please submit them through Issues, or directly submit a Pull Request.
- 🐛 Bug Fixes: Discover and fix system defects
- ✨ New Features: Propose and implement new features
- 📚 Documentation Improvements: Enhance project documentation
- 🧪 Test Cases: Write unit tests and integration tests
- 🎨 UI/UX Optimization: Improve user interface and experience
- Fork the project to your GitHub account
- Create a feature branch
git checkout -b feature/amazing-feature - Commit changes
git commit -m 'Add amazing feature' - Push the branch
git push origin feature/amazing-feature - Create a Pull Request and describe the changes in detail
- Follow the original code style
Use Conventional Commits standards:
feat: add batch document upload functionality
fix: fix vector search accuracy issues
docs: update API documentation
test: add search engine test cases
refactor: refactor document parsing module
This project is released under the MIT license. You are free to use, modify, and distribute the code of this project, but you must retain the original copyright notice.






