TECHNICAL RESOURCES
This resource page is a starting point for the technical development of your solution, designed to enable you to start building quickly. The tools, models, and frameworks outlined below are recommended, not mandatory. Feel free to use the technology of your choice.
The scope of your solution
In general, we’re looking for solutions from the legal domain which utilise large language models (LLMs). Closer to the Hackathon.
Need inspiration? Check out the problem statements provided by co-organising institutions and our sponsors. Most of them offer non-monetary prizes to the team that best addresses their problem.
Building the solution
From the technical standpoint, LLM-based web applications usually consist of the following three architectural layers:
- Front-end – the user interface of the application. This is what your users see and interact with. The front-end is rendered and executed in the user’s web browser, collects user input and presents the LLM responses in a user-friendly manner.
- Back-end – the application’s logic which exists on a remote server. Typically, the back-end handles user sessions and connects the user with the underlying LLM model.
- The LLM model – a computer programme based on artificial neural networks to generate, summarise, analyse, and comprehend text. It also can perform tasks such as sorting data into categories (i.e., statistical classification).
The following sections outline how to select suitable technologies for each of the architectural layers outlined above. If you prefer to use a “no code” solution, we have a section on that as well.
Let’s start with the LLM model, as this is they key building block of your application.
I. Choosing the LLM model
In general, you have two options: You can choose a (1) general-purpose LLM or (2) an LLM specifically trained for the legal domain. While general purpose models are heavily researched and developed by top-tier AI companies, and are superior in the number of parameters and the size of the context window (we explain both of these terms later in the text), the domain specific models are trained with specialised data and can perform better on the domain-specific (such as legal) tasks.
Understanding the LLM jargon
Before we deep dive into the list of models, let’s introduce some of the common technical terms. Below, we explain the basic principles of LLMs and and introduce the jargon (in bold).
When somebody designs a new LLM, it is at first an “empty” neural network. Such network consists of millions of artificial neurons (i.e., perceptrons), which are interconnected. The connections between perceptrons differ in terms of how strong they are. When we change the strengths of the connections, we change the behaviour of the model. At the moment of creation, the connections have randomly assigned strengths. The goal is to set up the connections such that the model gives us optimal performance.
Due to the number of connections in LLMs, we wouldn’t be able to set them up manually. However, we can use a large set of example questions and answers to determine, how to change the strengths of the connections in the model to obtain the desired results. The process which changes the strengths of the connections using example questions and answers is called model training. The example questions and answers themselves are called training data.
Training of state-of-the-art LLMs is a long, expensive, and complicated process, which is why only a few companies around the world can afford to do it. Trained general purpose LLMs are sometimes referred to as foundation models. It would be virtually impossible for you to train your own foundation model from scratch for the Hackathon, but the good news is you don’t have to do it.
Often, we would like a bit more than just a foundation model. To improve the performance of the given task, the foundation model can be trained with additional data. Typically, this new training data comes from a specific field in the industry. The resulting models are referred to as domain-specific LLMs. The practice of additional model training is called model fine-tuning. Due to the technical complexity, it would be almost impossible for you to fine-tune your own model from a foundation model at the Hackathon. However, at this stage, you don’t have to worry about model fine-tuning. Should your team be successful, you might want to explore model fine-tuning after the incorporation of your company.
Training is not the only way an LLM can gain knew knowledge. There is a popular technique for “injecting” information into an existing LLM: Retrieval-Augmented Generation (RAG). Using this technique, when you start a new chat session with the user, you can first programmatically insert a pre-existing set of documents into the context of the conversation – otherwise known as context window. LLM models significantly differ when it comes to the size of the context window, or in other words, in the ability to consume additional data just before the start of the conversation. The bigger the size of the context window is, the more text you can put in.
To learn more, you can refer to Groq’s blog post and video explaining RAG as well as their RAG implementation example.
The performance of LLM models can be measured using benchmark tests (i.e., benchmarks). LawBench is a popular domain-specific benchmark for Law LLMs. The current LawBench leaderboard of selected LLMs can be found on their website.
Finally, once you know which model you want to use, it needs to placed on some server connected to the internet, so that the logic of your application (i.e., the back-end) can connect to it. You have two options here:
- You can use an off-the-shelf service (i.e., managed service) and connect to it via API
- You can manually deploy an open-source LLM to a cloud-based service
Available LLM models
At the Hackathon, there are no restrictions in terms of the model you use in your application.
Depending on how adventurous you are, we recommend to choose from the options below.
No code options
The following providers offer no-code solutions, if you would like to completely avoid programming your own application. Each of them is a one-stop-shop, particularly suitable for non-technical participants:
- OpenAI’s AI studio
- Google’s Gemini AI studio and Vertex AI Studio (Google Cloud)
- Mistral Large available via Azure AI Studio
- Groq’s Playground
- You can create a free account to use the Playground.
- Should you have questions, feel free to ask a Groqster in-person at the Hackathon or post your question to the Hackathon’s Discord server.
- Alternatively, you can join Groq’s very own developer community on Discord.
Convenient options available via API
The following models are all general-purpose and available as a managed service:
- OpenAI’s ChatGPT
- Google’s Gemini and Vertex AI Studio
- Introducing Gemini (a blog post)
- Gemini for application developers (YouTube video)
- Google Cloud online courses: Introduction to Gen AI and LLMs, and Get Started with Vertex AI Studio
- Sample code and notebooks for Generative AI on Google Cloud
- Mistral AI’s La Plateforme (yes, that’s the actual name!)
- Mistral AI’s Le Chat interface with access to GPT-4 level models that is completely free
- Amazon’s Bedrock
- Getting started guide
- Quick start guide on GitHub
- Further Bedrock examples on GitHub
- ScrapeGraphAI with Amazon Bedrock. ScrapeGraphAI is a Python library that uses LLMs and direct graph logic to create scraping pipelines for websites, documents, and XML files.
- Groq’s Console API
- You can create a free account to use the Console and generate API keys. Reach out to [email protected] with your organisation ID and they will happily remove the rate limits for the duration of the Hackathon.
- Should you have questions, feel free to ask a Groqster in-person at the Hackathon or post your question to the Hackathon’s Discord server.
- Alternatively, you can join Groq’s very own developer community on Discord.
- For further information, see the quickstart guide, code examples, and showcase applications.
Advanced options
- SaulLM-7B: a pioneering open-source LLM for legal applications based on Mistral 7B. The model is available on Hugging Face, where you can deploy it from to a cloud platform such as AWS SageMaker, Azure ML, or Google Cloud.
- Cambridge Legal LLM: To be confirmed
II. Back-end
If you intend to use one of the managed LLMs listed above (or SaulLM-7B), we recommend to use:
- LangChain abstraction layer, the most popular and well-documented middleware for LLM back-ends
- Using Langchain with AWS
- Python or TypeScript programming languages, both of which can be used in companion with LangChain
- The Groq Python and TypeScript SDK packages, both of which can be used in companion with LangChain and the LangChain Groq integration
Video tutorial: LangChain Explained in 13 minutes
Cloud services
All participants will have free access to selected Google Cloud and AWS services during the day of the Hackathon (only).
Credentials are available at the Info Point.
Databases
To store data in your application such as user profiles or chat history, we recommend to use a managed database on the cloud platform of your choice. Each of the providers has a free tier, which should be sufficient for the event.
Important: Hackathon participants will get free credits for the services on the Google Cloud platform.
III. Front-end
We recommend the following building blocks:
- TypeScript programming language
- React UI library
- Vite tooling
- And perhaps most importantly, the Deep Chat UI component
Video tutorial: TypeScript for React in one hour
Video tutorial: React for Beginners in one hour
Legal data for the context window (RAG)
If you intend to utilise the context window of your LLM, we recommend exploring the following resources:
Liquid Legal Institute (LLI)
- LLI’s open-source knowledge base with variety of resources for natural language processing and legal tech, listing a wide range of datasets
- LLI’s open-source LLM Initiative
- LLI’s Legal Operations community
- LLI’s Metaverse for Legal
- LLI’s EU-funded project on license-aware crawling of web content by automatically identifying and retrieving content licenses
Legislative data:
- British and Irish case law and legislation
Cellar is the common data repository of the Publications Office of the European Union. Their API allows performing different operations on EU Treaties, International agreements, Legal acts, Complementary legislation, and Preparatory documents.
If you’re interested in any of these sectors, contact us. We may have the Cellar team download data subsets beforehand to prevent API issues during the hackathon.
Case Law data:
- Many case law datasets are available in the Liquid Legal Institute GitHub page
- British and Irish case law and legislation
- Case law of EU court of justice is available in the Cellar repository
- After the hackathon, note also The Cambridge Law Corpus: A Corpus for Legal AI Research
Contract data
- Law Insider: Sample Contracts by Contract Type
If you’re interested in innovating with contract data, contact us. We may enquire with sponsors if they can assist you with the dataset.
After the Hackathon
If you would like to continue working on your solution after June 23rd, you might want to consider fine-tuning your own LLM model. You can use resources like the Mistral fine-tuning codebase.
We’re eagerly waiting to hear about your progress. If you have something interesting to share, we’re more than willing to introduce you to the event’s sponsors who might be able to assist you further. Stay in touch at [email protected]!