All Dataset engineering Listings
Loading
...
Loading
...
Loading
...
Loading
...
Loading
...
Loading
...
...
...
...
...
...
...
...
...
...
...
...
...
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
A system for quickly generating training data with weak supervision
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/
Crawl a site to generate knowledge files to create your own custom GPT from a URL
800,000 step-level correctness labels on LLM solutions to MATH problems
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
The open-source platform for training advanced AI models and image diffusion.
Listings per page