Introducing Giza Datasets
From Dark Forest to Open Garden
Jan 31, 2024
Fran Algaba
Co-founder & CEO
Web3 Data for ML: From Dark Forest to An Open Garden
The advent of Web3 brought with it the promise of infrastructural transparency and permissionless access. Extending beyond the consensus layer and smart contracts themselves; it encompasses the vast data landscape that they harbor. Blockchain data is theoretically accessible at any time, reflecting the entire state of the blockchain. However, in practice, this wealth of data is not readily available to those outside the inner circles of an ecosystem niche. To store the complete history of a blockchain locally is a task demanding not just specialized knowledge but also significant hardware resources, often dedicated solely to this purpose.
In addressing the challenge of accessing blockchain data, the industry has seen the emergence of several solutions. RPC providers, offering API access to blockchain nodes, and indexing services, enabling the use of SQL and GraphQL for data extraction, have been instrumental in this regard. However, both approaches come with their own set of limitations. RPC services, not originally designed for large-scale data queries necessary for computation-heavy applications, often fall short in meeting the needs of more intensive use cases. Similarly, while indexing services provide a more structured approach to data retrieval, the complexity inherent in Web3 protocols means that constructing effective queries can be a formidable task, sometimes requiring hundreds or even thousands of lines of complex code. This intricacy poses a significant barrier, particularly for general data practitioners and those without deep expertise in the nuances of Web3. The cumulative effect of these limitations underscores the pressing need for more accessible and user-friendly methods of extracting and utilizing blockchain data, paving the way for broader application and innovation in the field.
Our vision is to foster a specialized ecosystem dedicated to refining the processes and standards of data collection. This ecosystem is designed to enhance the quality and efficiency of downstream machine learning applications. In such an environment, data scientists and machine learning practitioners could contribute to the development of areas like DeFi protocols without needing an exhaustive understanding of blockchain state changes or data storage intricacies.
Introducing Giza Datasets
Towards achieving the vision of an open and thriving ZKML ecosystem, we are excited to introduce Giza Datasets, an initiative that marks a foundational step in the direction of an AI enabled Web3. This launch is set to address the data-related challenges identified in the realm of machine learning (ML) and blockchain integration. With Giza Datasets, our goal is to lay the groundwork for a more accessible and effective use of blockchain data in ML applications.
Our goal is to simplify the entire machine learning process within Web3. By achieving this, data scientists and machine learning engineers will be able to easily and meaningfully engage with Web3 and its data ecosystem. They will have the ability to easily access data, train models, evaluate their performance, and deploy models that are verifiable on-chain, all via a unified platform. This framework will effectively mitigate complexities associated with blockchain technology and cryptography, making the process straightforward and accessible.
The Foundation for ZKML Adoption
Recognizing the hurdles in data accessibility and quality within the blockchain space, Giza Datasets is designed as a solution to catalyze open innovation. Our aim is to foster an ecosystem centered around ML-oriented datasets derived from blockchain data. This initiative begins with the open-sourcing of several key datasets, designed to simplify the entry of the community into the world of ML-based solutions for blockchain.
By providing these foundational datasets, we are not only addressing the immediate needs of data accessibility but also catalyzing the growth of ML applications within the blockchain domain. The open-sourcing of these datasets represents our commitment to breaking down barriers and enabling a more inclusive environment for innovation in ZKML solutions.
Growing the Ecosystem of Datasets
Giza Datasets is envisioned as an evolving and expanding ecosystem. Over time, we plan to enrich this ecosystem by continuously providing more curated data to the community. This growth is aimed at democratizing access to high-quality, relevant datasets, ensuring that developers, researchers, and enthusiasts in the ML space have the resources they need to build effective and innovative solutions for the blockchain.
The expansion of Giza Datasets will be guided by the needs, feedback and eventual contributions of the community, ensuring that the data provided is not only diverse but also directly aligned with the evolving challenges and opportunities in ZKML development.
Getting Started with Giza Datasets
Giza Datasets leverages the Polars library, renowned for its efficient handling of large datasets. Polars offers a robust, expressive syntax and high-performance computations, making it an ideal choice for dealing with complex blockchain data.
To begin using Giza Datasets, the first step is installing the package. This can be done easily using pip:
Giza Datasets simplifies the process of accessing and utilizing blockchain data for ML purposes. Here's how you can load a dataset:
This is just one example of how to use Giza Datasets. For the full list of Datasets available and further details feel free to check our documentation.
Introducing Osiris: Simplifying Data Consumption in ZKML
Alongside Giza Datasets, we are proud to announce the open-sourcing of Osiris, a Python tool that streamlines the consumption of data into verifiable models for inference. Osiris is designed to ease the integration of various data types – from CSV and Parquet files to PNG images – into Orion models for verifiable inferences. This tool represents a significant stride in simplifying the process of feeding data into ML models, making it more efficient and accessible for users at all levels of expertise.
With Osiris, users can seamlessly load input data into their ZKML models, enabling quicker and more effective development of blockchain ML applications. This tool is a testament to our dedication to providing comprehensive solutions that not only address data accessibility but also the practical aspects of ML model development and deployment.
Conclusion
Machine Learning is a complex software practice which has been traditionally performed by highly specialized actors and organizations. In our mission to transform the ML practice into an open ecosystem, we have seen the dire need to organize, refine and extend blockchain data for next generation smart contract capabilities to emerge in the form of AI Actions.
Giza Datasets, along with Osiris, lays the foundation for the future of AI Actions. As we embark on this journey, we invite the community to join us in exploring and contributing to this growing ecosystem. Our commitment is to continually enhance and expand Giza Datasets, empowering developers and innovators in the ML space for the blockchain.