Chainbase Open Sourced Theia-Llama-3.1-8B

We released TheiaChat in August, which is just an alpha version chatbot designed to showcase the basic capabilities of Theia. And in this blog, we introduced what is Theia and why we built Theia. Recently, we open sourced our first version Theia-Llama-3.1-8B, which is trained with carefully-designed dataset from the crypto field.

Technical Implementation

Crypto-Oriented Dataset

The training dataset draws from two primary sources to create a comprehensive representation of blockchain projects. The first is CoinMarketCap, focusing on the top 2000 projects by market capitalization. This includes project-specific documents like whitepapers, official blog posts, and news articles. The second source comprises detailed research reports on these projects from credible internet sources, offering in-depth insights into project fundamentals, development progress, and market impact. After compilation, the dataset undergoes both manual and algorithmic filtering to ensure accuracy and eliminate redundancy.

Model Fine-tuning and Quantization

Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B-Instruct) and tailored for the cryptocurrency domain. We used LoRA (Low-Rank Adaptation) to efficiently fine-tune the model, adapting large pre-trained models to specific tasks with less computational power. Our training process was enhanced using LLaMA Factory, an open-source framework, and DeepSpeed, Microsoft's distributed training engine. This optimized resource use and training efficiency. We employed techniques like ZeRO (Zero Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism to speed up training and reduce memory use. We also built a fine-tuned model using the novel D-DoRA, a decentralized training scheme developed by Chainbase Labs. We're releasing the LoRA version first because it's easier for developers to deploy and experiment with in the Crypto AI community.

In addition to fine-tuning, we've quantized the model to optimize it for efficient deployment, specifically into the Q8 GGUF format Theia-Llama-3.1-8B-Q8_0.gguf. Model quantization reduces the precision of the model's weights from floating-point (typically FP16 or FP32) to lower-bit representations—in this case, 8-bit integers (Q8). The main advantage of quantization is that it significantly shrinks the model's memory footprint and speeds up inference while maintaining acceptable accuracy. This makes the model more accessible for use in resource-constrained environments, such as edge devices or lower-tier GPUs.

Benchmark

To evaluate current LLMs in the crypto domain, we've proposed a benchmark for Crypto AI Models—the first of its kind tailored specifically for this field. The models are evaluated across seven dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities. A detailed paper elaborating on this benchmark will follow.

We're initially releasing the results of benchmarking the understanding and generation capabilities in the crypto domain for 11 LLMs—both open-source and closed-source—from OpenAI, Google, Meta, Qwen, and DeepSeek. For open-source LLMs, we've chosen models with parameter sizes similar to ours (~8B). For closed-source LLMs, we've selected popular models with the most end-users.

Model	Perplexity ↓	BERT ↑
Theia-Llama-3.1-8B	1.184	0.861
ChatGPT-4o	1.256	0.837
ChatGPT-4o-mini	1.257	0.794
ChatGPT-3.5-turbo	1.233	0.838
Claude-3-sonnet (~70b)	N.A.	0.848
Gemini-1.5-Pro	N.A.	0.830
Gemini-1.5-Flash	N.A.	0.828
Llama-3.1-8B-Instruct	1.270	0.835
Mistral-7B-Instruct-v0.3	1.258	0.844
Qwen2.5-7B-Instruct	1.392	0.832
Gemma-2-9b	1.248	0.832
Deepseek-llm-7b-chat	1.348	0.846

The lower the perplexity value, the better the performance. The higher the value of BERT, the better the performance. It can be seen that the performance of Theia-Llama-3.1-8B exceeds that of mainstream models currently available on the market.

Next, we will build larger models and evaluate more dimensions of the models. If you are interested in technical details, please stay tuned.

About Chainbase

Chainbase is the world's largest omnichain data network designed to integrate all blockchain data into a unified ecosystem, providing an open and transparent data interoperability layer for the AI era. It has designed a novel dual-chain technology architecture that bridges the programmability and composability of crypto data, which supports high throughput, low latency, and eventual determinism, as well as higher cybersecurity through a dual staking model.

Chainbase attracts more than 15,000 developers and data scientists, and the 8,000 applications built on it form a vibrant ecosystem. People can now freely access and train data models to integrate predictions, analytics and insights into their applications.

🔗 Links:

Website: https://chainbase.com/
Discord: https://discord.gg/chainbase
Documents: https://docs.chainbase.com/
Twitter: https://twitter.com/ChainbaseHQ
Github: https://github.com/chainbase-labs
Huggingface: https://huggingface.co/Chainbase-Labs