InfoQ Homepage Podcasts Meryem Arik on LLM Deployment, State-of-the-Art RAG Apps, and Inference Architecture Stack

Meryem Arik on LLM Deployment, State-of-the-Art RAG Apps, and Inference Architecture Stack

Bookmarks

Jun 10, 2024

In this podcast, Meryem Arik, Co-founder/CEO at TitanML, discusses the innovations in Generative AI and Large Language Model (LLM) technologies including current state of large language models, LLM Deployment, state-of-the-art Retrieval Augmented Generation (RAG) apps, and inference architecture stack for LLM applications.

Key Takeaways

One of the important LLM usage considerations is to figure out what modality you care about. If you are doing text processing, then you don't need an image model or an audio model.
If you are looking to go to mass scale production with LLMs or you care about data residency and privacy aspects.
Challenges with self-hosting LLM models are: quality of the model, infrastructure, and getting access to GPUs.
For the state-of-the RAD application, the important components are data pipelines and embedding search.
In terms of regulatory compliance standards, we need to strive towards a regulatory alignment between the EU, Britain, the U.S, and Asia.

Subscribe on:

Introductions [00:50]

Srini Penchikala: Hi everyone. My name is Srini Penchikala. I am the lead editor for AI, ML and data engineering community at InfoQ website and a podcast host. In today's episode, I will be speaking with Meryem Arik, co-founder and the CEO of TitanML, the company behind generative AI solutions for regulated industries. We will discuss the current state of large language models, or LLMs, and how to deploy and support those LLMs in production, which is very important in successfully adopting generative AI or Gen AI technologies in any organization. Hi Meryem, thank you for joining me today. First, can you introduce yourself and tell our listeners about your career, your focus areas, and what you've been working on recently?

Meryem Arik: Of course. Thank you so much for having me. So I'm Meryem Arik. I'm the co-founder and CEO of TitanML. And what we do is we build infrastructure to help regulated industries deploy generative AI within their own environments. So instead of sending their data to ChatGPT or Anthropic, we help them deploy within their own on-prem or VPC infrastructure.

In terms of my background, so I have actually quite a varied background. I've always been very checky, I guess. My degree was in theoretical physics and philosophy, and that's actually where I met my co-founders. All three of us are physicists. I then went off into the world of enterprise and my co-founder stayed in the world of research, but we stayed very, very close friends. And we started the business a couple years ago because their research was all about efficient inference optimization and compression. And I was sitting in the enterprise world. And what we could see is that there was this really big gap between what was possible in the research literature and what enterprises were actually able to achieve. And we saw that there was going to be a very big infrastructural gap when AI really came around and that's what we looked at solving.

Current State of Large Language Models (LLMs) [02:48]

Srini Penchikala: Thank you. It seems like generative AI and LLM technologies are in full swing lately. Recently at Google I/O Conference, Google announced several new developments including Google Gemini updates and what they call Generative AI in Search, which is probably going to change the search functionality on Google. And OpenAI released GPT-4o, the Omni model, that can work with the audio, vision and text in real time. And also recently Llama 3 was released. So it looks like there is a lot of innovation happening and it seems like this space is accelerating faster and faster than ever. Where do you see Gen AI and LLMs are at this point, which is only few years since ChatGPT came out, and how do you see it's innovating in the rest of this year and next year?

Meryem Arik: It's been a pretty crazy couple weeks. So for context for the listener, we're talking I think two days after the Google I/O conference and three days after the OpenAI conference. And then I think about three weeks from Llama 3. We got started in the industry before it was called LLMs. We got started back when it was called NLP, and it's been the most phenomenal astronomical growth and technological advancement since we first started the company. So back then we were working with GPT-2 models and we thought they were really impressive because they could write a poem that sounded like Shakespeare.

And now we're at a stage where I can have a language model that can constantly see my screen and give me suggestions about how I'm working and give me real-time audio feedback. That is just completely night and day. So the rate of progression is just enormous and it really feels like we've cracked some kind of understanding where all of the players in the field are building these same kind of capability models.

But I think it's important to not get too hung up on how much the models are improving by every release. It's been said a couple times and recently Matt Turk said it in the tweet that he wrote, which is that even if we stop LLM innovation, we probably have around a decade of enterprise innovation that we can unlock with the technologies that we have. So we're already at the stage of enormous potential even if we stand still. And we're not standing still. So it's an incredibly exciting time to be in this space.

Srini Penchikala: Yes, we are not standing still. We are not even moving linearly. It's almost like innovating on a logarithmic scale. Right?

Meryem Arik: Exactly. Something that I'm really excited that I think will come up in the next year, because I think you asked me about that in a previous question, which is the first thing is that I think we're going to see increasingly impressive capabilities from surprisingly small models. So I can give an example of this. The Llama 3 eight billion was as performant as GPT-3.5. And if we all think back to when GPT-3.5 was released, we thought it was some kind of magic. And now we're able to get that through actually a relatively small eight billion parameter model. And I think we're going to increasingly see these smaller size models providing better and better outputs. And I think that's just because the quality and number of tokens that we're training them on. So that's on the one end, which will really help the enterprise scale specifically.

On the other end of the spectrum, we're starting to see emergent technology and phenomena. So the GPT-4o model, it looks like it's been trained to be natively multimodal. It looks like it can do audio to audio conversation rather than having to go through to text as a middle layer. And that's incredibly exciting and I think we'll start seeing some very impressive technologies from those proper frontier level models, especially playing with multimodality as well. So that's what I think we're going to see over the next year, more enterprise-friendly scale models, and then also these huge models with amazing multimodal abilities.

LLM Use Cases in Regulated Industries [06:52]

Srini Penchikala: Yes, definitely. I would like to talk about the enterprise adoption a little bit later in the podcast in terms of RAG, the retrieval augmented generation techniques. But let's start with a couple of other questions to catch up on the LLM space. Right? So can you talk about some important use cases for using LLMs for our listeners who may be new to this? Especially, I know your focus is regulated industries, so how do you see LLMs helping in terms of security, privacy, compliance and other areas?

Meryem Arik: Yes. So it can be quite difficult for people to imagine what LLMs are able to do because it's quite different from any other form of technology that we've had before. The way that I get my clients to think about it is I ask them to think of these LLMs almost like having an intern. And what would you do if you got given access to essentially unlimited free interns? And that's kind of what I think you can do with Gen AI. So if there are tasks in your organization that you feel could be done if an intern had supervision, that's the kind of thing that you can do with these kinds of models. So I don't think we're at the stage yet where we're replacing senior-level jobs, but there's a huge number of tasks that happen in every single industry that can be delegated and that can be broken up into smaller tasks.

One of the most common use cases that we see as a 101 for enterprise, and I think we'll talk about this when later when we talk about RAG, is essentially acting as a research assistant or some kind of knowledge management system. It's incredibly good at searching through large large large swathes of documents and then summarizing and using that as a research piece. So that's kind of the early uses that we're seeing, but I think we're going to see increasingly creative and also niche use cases that people really investigate the workflows they have in their business and what they could automate if they had access to, for example, these recent graduates.

How to Choose an LLM Model [08:57]

Srini Penchikala: Yes, definitely you're right. I don't think LLMs will completely replace human efforts, but they can definitely augment a lot of things we do, right? Also, in terms of LLM models, there are these base models and open source and proprietary models. How do you see this space? I know there are Llama models and then there's Lava for vision and of course the GPT-2, three, four, 4o now. And then Google has Bard and Claude and all and models. So can you talk more about these? There's so many models, how to even go about these base models? What should the developers consider when they're looking to leverage LLMs in their applications?

Meryem Arik: So I think what's important to realize for your listeners that might not be familiar with base models and these foundation models is that they come pre-trained. So it's not like in old school, I say old school, only a couple of years ago, data science and machine learning where you would have to spend a lot of time and effort into training. These models come pre-trained. And for the modality that you're looking at will already have a really good base level of understanding that you can then kind of work on top of. When you look at the different foundation models, and you already pointed out a number of them. So for example, on the open source side, we have Llama models, we have the Llava models for vision, we have Whisper, we have all of the proprietary API base models. There's a huge number to choose from. So it can be very, very difficult to know which one's right for your particular application.

So I'll give a couple heuristics, which might be helpful. The first heuristic I'll give is figuring out what modality you cared about. So if you are just doing text processing, then you don't need an image model, you don't need an audio model and you need a text model. If you are doing table parsing or image parsing or audio, then you need something that works with that particular modality. So the modality is the first one that you should worry about. The second decision that you'll typically have is, okay, I know my modality. Then you'll have this big choice between API base models and self-hosted models. And this is probably one of the biggest choices you have when you're comparing different models because once you're in the various regime, it's easy to just go for the best of the best of each one. So if you are experimenting or if you don't have particularly strict compliance or data residency regulations and you're not going to deploy at a huge scale, it really makes sense to start with these API base models. Things like OpenAI and Anthropic, they're really good places to start.

If however, you are looking to go to mass scale production or you really care about data residency and you really care about that privacy aspects, or maybe you're really cost sensitive, then it would make sense to start looking at those self-hosted models. That's a key decision you have to make. So firstly, the modality. Secondly, open source versus self-hosted. And then the third decision I would give is around essentially that cost to accuracy trade-off that you need to make. Almost all of the API providers, and it's also available on open source, have different sized models that perform differently but have different cost ratios. So if you're asking it to write a very easy task, you might ask it to use a very small model. If it's a very complex task, you might use very big and very expensive model. So they're kind of the three highlights.

And then the final thing I would say is if you are working in something domain-specific or maybe in a language that isn't very well-supported, you might also want to start looking at fine-tuned variants of all of these. So just to summarize, because that's actually quite a lot of information. Number one, what modality is it that you care about? Number two, do you want to go API-based or self-hosted? Number three, what is the size, performance, cost trade-off you're willing to make? And then number four, if it's something very niche you're doing or in a foreign language, you might want to do something fine-tuned.

Self-hosted LLM deployments [13:02]

Srini Penchikala: I know you've been focusing on the self-hosted LLM deployments, how to help the companies not to use this API-based, which kind of come with its own data privacy and latency type of challenges. So on the self-hosted models, I know the Ollama is one of the examples. So can you talk about what are the pros and cons? I know there are some advantages for self-hosting, but also, I'm sure it comes with some constraints, right? So what should the teams keep in mind? Self-hosted is not all you see, right? So there are some challenges as well.

Meryem Arik: Yes, exactly. There's definitely challenges with self-hosting. And if you just want to go for the raw easiest option, API-based is the easiest. So there's broadly three challenges or reasons you might not want to self-host and they're changing day by day. So I'll kind of tell you the challenges and I'll also describe how I think they're changing. The first one is around the quality of the model. This is especially something that we saw last year. The state-of-the-art models in the API-based regime, so for example, OpenAI, was so much better than what we had in open source that in terms of just raw quality, you just weren't able to get the same kind of quality that you needed. So that was a challenge that a lot of people had last year if they needed self-host is it was very difficult to get the quality of models they needed.

This has changed dramatically. And Llama 3 was a really big turning point. It was the first time that we've seen open source models essentially perform as well as API-based models for a better cost trade-off. So that quality of model thing is going away, but that's something you need to evaluate.

The second reason why people might find self-hosting challenging is there's a lot more infrastructure that you're responsible for when you're self hosting compared to when you're using an API. So when you're using an API-based model, you take for granted a huge amount of infrastructure that they've built behind the scenes. So everything from the batching server, the model optimization that they've done, they've enabled function calling. This is all stuff that if you're self-hosting and you care about, you need to now build yourself. That's the whole reason that we exist as a company is we provide that infrastructure that you take for granted in the API regime, but provide it for the self-hosted regime.

And then the final challenge around self-hosting is essentially getting access to GPUs. This is still something that teams struggle with. Even though the GPU crunch has eased since last year, it's not necessarily easy to be able to get the GPUs that you require. And also unless you're deploying at an enterprise scale, then it might not be cost-effective for you to rent those GPUs as well. So that's the quality thing, which I think is going away pretty quickly. The ease of use, which we're trying to get rid of, but it's always going to be there because it's more work. And then finally, whether you can use those GPUs efficiently and effectively.

Srini Penchikala: Right. So assuming that all of these, the check mark all of these and self-hosting is the way to go, how do you see this space working, right? Do you see large companies going with the self-hosting model because they can afford to host on-prem or do you also see smaller companies doing this because they have more control on the overall process? So what size companies can benefit from this?

Meryem Arik: So I actually am surprised to say that it's not just big companies. I would've thought that it would just be something that really big companies with very big data centers would be investing in. But actually that's not what we've seen necessarily from our client base. That's not to say we see very small startups doing this, but for example, mid-market businesses and scale-ups are really investing in these self-hosted capabilities.

For the big, big companies, it's typically because of privacy concerns and data residency concerns. But for the smaller companies, and this is potentially counterintuitive, it's actually very often because of the performance you can get. And now that's both latency and throughput performance that you can get by using self-hosted, you can get much faster responses, for instance. Also when you're self-hosting models, you get access to a much bigger choice of models that you can use. So actually, if you're building a state-of-the-art RAG app, you might think, okay, the best model for everything is OpenAI. Well, that's not actually true. If you're building a state-of-the-art RAG app, the best generative model you can use is OpenAI. But the best embedding model, the best re-ranker model, the best table parser, the best image parser, they're all open source. So you might have, if you're a smaller company, some kind of hybrid solution where you can get better performance by using some open source and self-hosted components.

State-of-the-art RAG Apps [17:54]

Srini Penchikala: Yes, definitely. I was recently exploring Ollama, the framework, and I was able to download Llama 3, Mistral and Lava. So I was not constrained by any of these models, right? I would've been if I chose OpenAI or Meta or some other proprietary hosting model, right? So we can talk about RAG a little bit. I know you mentioned the research assistant knowledge management examples, and you just mentioned the state of the art RAG app. Let's kind of jump into that. So how do you define the characteristics of a state-of-the-art RAG app, what it should have to be really effective and efficient and valuable to the teams?

Meryem Arik: Yes, so RAG or Retrieval Augmented Generation is a technique that I would actually say the majority of production scale AI applications are using. And what it does is it enables your LLM, your generative LLM to call on data. So that's data that you might have stored in your business. So there are a couple key parts of the RAG app that I think are really important to get right, and there are two parts which get talked about a lot that I actually think are not very important. The first one is the vector database. I don't think the vector database of choice is that important. They're pretty commoditized. Most of them are fairly similar. So I don't spend a lot of time thinking about that choice. The second thing that I don't think you should necessarily spend a lot of time thinking about, which is counterintuitive, is which generative model you should use.

People are like, should I use Mistral or Llama 3 or whatever? It actually typically doesn't matter. And the reason why it doesn't matter is because the most important things are your data pipelines and your embedding search. And it's the whole thing of garbage in, garbage out. If you feed the best model in the world, really bad search results, then you're going to get a really, really bad output. And if you feed in really amazing search results into a not very good model, you'll probably still get a good output. So the two things that I think are really important, the first one is that document processing or data processing pipeline, for example, how do you chunk your text. If you have images or tables, how do you parse those images and tables so that they can be searched on efficiently? This is incredibly important to get right, and that's actually why we recently released support for improved table parsers and improved image parsers as well within our server.

And then the second thing you need to get right is actually non-generative as well, but it's essentially the semantic search part of that application. The best way by far that we have found to do that semantic search is essentially a two-stage process. You can make it more complicated, but highest level two-stage. The first one is the embedding search, and then the second one is the re-ranker search. So the embedder search is very, very good for searching over vast documents and then kind of picking out the ones that it thinks is most relevant. And the re-ranker search is a more expensive search, but it's a way that given you have a kind of short list to be able to say which ones that are the most relevant. So a combination of an embedder and re-ranker are fantastic for that RAG application.

If I think about the actual models that I need to deploy or self-host, the LLM is one of them, and that could be either API or self-hosted, whatever. But the others are, I need a table parser, I need probably an image parser, I need an embedding model, and I need a re-ranker model. And that's a lot that you need to deploy and optimize and orchestrate. So we've built into our containers that you can deploy it all in one, so you can hit a single endpoint, but that's the kind of number of models you need for a really, really good RAG application.

LLM Deployment Techniques [21:43]

Srini Penchikala: Yes, before I'm definitely very interested in learning about the TitanML inference stack architecture. Before we delve into that, let me talk about the recent presentation that you gave at QCon London. Can you highlight some of the tips? I know you talked about the LLM deployment techniques. Can you highlight a few of those techniques and have you any other additional observations since the conference?

Meryem Arik: Yes. The talk that I gave was my tips and tricks for LLM deployment. It was essentially a combination of tips that I've learned over the last couple of years working with our clients that we end up saying over and over again, and I've written that talk up into a blog which will be released at some point on the InfoQ website. And I'm also going to be doing a revised version of that talk in San Francisco in November or December. So you'll be able to see how much of the landscape has changed between those two talks. I gave seven tips over overall, maybe I'll pick out one or two that I think are interesting to highlight. The first one that I'll highlight is that people don't spend enough time thinking about their deployment requirements and their deployment boundaries, and they will often kind of be left to scramble when they get to deploy.

But what we find is really helpful is when you start designing your application, if you have very hard and fast deployment requirements, like it needs to be real-time or it can be batched or it has to be deployed on this GPU or it has to have this kind of cost profile, these are all things that you should know upfront because it will radically change the way you architect your system. So knowing your deployment requirements is crucial.

Another tip that I gave is that I recommended that unless you have pretty much unlimited resources, you should almost always quantize your model using fit four-bit quantization specifically. And the reason why is that given you have a fixed resource requirement, the performance for a larger model that's been quantized down to that size versus a model natively of that size will be far better. So you'll retain a lot of that performance and accuracy. And that's from a Tim Dettmers paper a couple of years ago called 4-Bit Scaling Laws or 4-Bit Precision Scaling Laws or something like that. So we almost always recommend using a bigger model that's quantized down rather than using a natively small model.

And I can give one more tip, which is that even though GPT-4 is the best model, although now it'd be GPT-4o or Claude or whatever it is, that doesn't necessarily mean you need to use it for everything. And that quite often, smaller models which are cheaper and easier to deploy are just as performant and you're not really getting any benefit from using those bigger models. So think about how hard your task is before you dole out for the biggest and best model. There's just three of them, but if you want to read all seven, you can read it in the blog.

LLM/AI Inference Architecture Stack [24:41]

Srini Penchikala: Yes, definitely. We will make sure that we link the article in this podcast transcript. So let's jump into the big topic, right, the architecture framework that your company offers, Titan takeoff inference stack. I'm definitely curious about this. This is an inference software to make self-hosting of AI apps simple and scalable and secure. Can you discuss more about this architecture stack? What technologies does it include and how can our listeners get started? Is there a tutorial or anything they can use to learn more about it?

Meryem Arik: For sure. So you mentioned Ollama before, which is a way of really easily running a language model. We are a similar kind of framework, but for design for scale and design for the enterprise. So we are essentially a way that you can run language model applications really efficiently on your own private hardware. And the stack comprises of a couple things. So I think what I'll do first is look at the zoomed out. What does it actually look like? Well, it's just a Docker container. It's a Docker container that when you pass a model through, exposes an API that you can then interact with. Exposes that LLM API that you can interact with, and you can deploy that Docker container on whatever hardware you care about. And we fully integrate with things like Kubernetes. So it's a containerized and Kubernetes native product.

Within that server, we've made the very unique choice. I think we're one of only a handful of teams worldwide that's ever done this, which is we've rolled out absolutely everything from scratch ourselves, pretty much. So the inference server, the server part is written all in Rust, so it's a multi-threaded Rust server, and then the inference engine, so this is the engine that makes the model run faster, so rather than just deploying a hugging face model onto a GPU, we do a lot of smart things with things like quantization and caching and inference optimization on that level. That's written in a combination of Python and then also the OpenAI's Triton, the Triton language.

And the reason we made those technical decisions of writing it in Triton rather than writing in CUDA is it means that we can be hardware-agnostic because Triton will compile not just to NVIDIA, but also to AMD and Intel as well. So we don't have to be tied to that CUDA stack. So they're a couple of decisions that we've made. The entire stack is written in Docker, Rust, Python, and then the Triton programming language.

Srini Penchikala: So how much of the overall process does this stack automate or help developers and teams who are going to use it?

Meryem Arik: What the stack does is it allows you to turn just a GPU, so raw hardware, plus a model into a very scalable endpoint that you can then build on top of. And things that make our platform in particular very unique is we're not just saying, okay, you now have Ollama three endpoint, but we support not just generative models, but embedding models, re-ranking models, image to text models, all kinds of modalities. And what's really interesting is you can deploy them in one container and you can deploy that container over multiple GPUs. So it automates all of the multi-GPU setup. It allows you to hit a single container for an entire RAG application, all through a declarative interface, which if you were trying to build this natively, you would have to worry about how I'm going to do my multi-GPU inference. I would have to be playing around with things like TensorRT LLM to make it run faster.

And from our clients, we benchmarked that this typically saves about two to three months per project by using our stack and then using our stack for both experimentation. So you can easily swap models in and out, but then also using it for production and knowing that it's going to scale properly and isn't going to come across horrible throughput rates as well. So it's a really significant time-save, and it means that the developers can focus just on building the application, which is actually the thing that brings them value, because they're working from an API that they know is stable, that they know will support when the new model comes out, they can just swap it in and out and they can just focus on building that better application.

Srini Penchikala: So let's jump back to the regulation side of the discussion a little bit, Meryem. So I was reading this Stanford University's AI index report for this year. So one of their predictions is the number of AI regulations in the United States will sharply increase. Can you talk about the AI regulations in general and any specific regulatory compliance standards in the U.S, UK, and the European countries? And how do you see regulation is always trying to catch up with innovation? So what is the balance between having some kind of good control on these technologies without disrupting or interrupting the innovation? What does “Responsible LLM or GenAI” look like?

Meryem Arik: Well, it's difficult for me, so I double majored in theoretical physics and philosophy. So I do have this philosophical streak to me where I do enjoy thinking about these things, but it's actually something that I find incredibly difficult to think about. What I'm not worried about is some kind of terminator style of risk. What I am incredibly worried about is the societal impact of the technology that we've already built. This is impact all the way from dis and misinformation to really mass underemployment. But I don't think those particular things are things that you can stop through regulation, because I think they're things that already exist in the technology that we have. So that's something that I'm very cautious of that I don't really know what the right or correct answer is. I also have this additional question that I think about of would it be safer if we lived in a closed source AI world or an open source AI world?

And both of them can be very scary. The closed sourced AI world, I think in particular scares me more. This idea of you have incredibly huge concentrated power, potentially like humanity changing power in the hands of a very, very small number of people is something that really scares me. And those companies become more powerful than even governments. But on the other hand, if you have open source regime, which I actually tend to lean towards, you do end up with models that can be harmful in the hands of bad actors. And we already have that right now. There are already websites where you can create deep fakes in pornography of children. That is something from a technology that exists right now that has been empowered by open source. So the answer is, I don't really know what the regulation should look like, but what I do think we should have is we should have governments that are incredibly concerned about this and are very, very thoughtful about this in a way that I don't necessarily think they are. I think they're concerned, but I don't necessarily think they're thoughtful.

And I also think that we need to be really striving towards regulatory alignment between the major powers. So the EU, Britain, the U.S, Asia should all be aligned on how we're regulating. And currently it feels like there's a lot of misalignment, which is incredibly confusing for all of the players in the field. So that's my broad answer, which is it's something that I'm incredibly concerned about and it's something that I think governments should think about very carefully. And I also think alignment is really what I would be arguing for, whatever the regulations end up being.

Srini Penchikala: That's a great point because some of these regulations don't have to be enforced by the government. They can be self-regulated, right? So like you said, what would be the human involvement post-AI world? So maybe that's where it is. Maybe we can use humans to make sure that any content that is not appropriate will not get out. So those [inaudible 00:32:52] control those. Yes.

Meryem Arik: Yes. There's a huge amount of self-regulation, and I think unfortunately that self-regulation has to happen from the big platforms. It's not like there's one central place where we can say, okay, this is allowed and this isn't allowed. It's what's allowed on Facebook, what's allowed on Twitter, what's allowed on all of these things. And that actually was already a concentration of power. And it's something that I experienced because I was a teenager when social media first came out, and we've already experienced this problem of what are we allowed to put on and consume on these platforms? And in the past, we've actually not been very good at putting things that don't harm people on these platforms. So I'm hoping that we've learned our lessons from my generation and for the next AI generation we'll be a bit more careful though.

Srini Penchikala: Yes, makes sense. Switching gears a little bit, right, so one of the predictions at a recent Microsoft conference, they said in no more than three years, anything that is not connected to AI will be considered broken or invisible. So that kind of tells us how big a part AI is playing, right? So how do you see AI playing a larger role in our work and daily lives in the future? And do you have any interesting use cases where you see AI can help?

Meryem Arik: It's an interesting quote, and I do genuinely think we'll see AI very, very deeply embedded in everything we do. Three years is quite ambitious though. I don't think they know the pace at which enterprise tends to move. I don't think enterprise do very much in three years. What are use cases that I'm really excited about? Well, the use cases I'm really excited by are actually the ones that are potentially the most boring. It's those really micro improvements that we'll see in every single workflow that if we can in every single workflow make it 10% more efficient and keep doing that over and over and over again, I think we get to very real transformation. I think what people often do is they're like, we're going to automate all of this role or all of this department, and I think that is a mistake. I think the things that I'm very excited by are the things where we see very, very meaningful and practical improvements to workflows that already exist. And that's what I think we'll see a lot of in the enterprise.

Srini Penchikala: Yes, definitely. I agree. I kind of see the real power will be not only automating the parts of the organization, but automating the whole organization, right? So can we bring in this power of AI into the system level where we can really start to see the synergy effect and the benefits?

Meryem Arik: There are entire industries that are popping up that I'm really excited by, and one of them is tech enabled services companies. So for example, law firms, they build by the hour, right? Like the Pure services. And I'm incredibly excited by these new law firms that are popping up, that are tech-first and AI-first, and are fundamentally transforming the way that services businesses run. I think that'll be really exciting.

LLM Online Resources [35:53]

Srini Penchikala: To start wrapping up this discussion, do you have any recommendations or resources online that our listeners can check out to learn about LLMs in general or any specific technologies?

Meryem Arik: Yes. Well, obviously you should listen to this podcast. You should listen to more of this podcast. We write a lot of blogs on our website. And if you are interested in quantization, which is something that I mentioned, we have an entire repo of free quantized models that are what we would call Titan certified. So these are the best enterprise-appropriate models that you can use and go on our Hugging Face page, Huggingface/TitanML or something like that. And then also for courses of starting to learn with LLMs, there are a bunch of really good courses. The Hugging Face course in particular, I think is quite good.

Wrap Up [36:38]

Srini Penchikala: Sounds good. Do you have any additional comments before we wrap up today's discussion?

Meryem Arik: No, it's been absolutely fantastic chatting to you, and I'm really looking forward to listening to your future podcasts as well.

Srini Penchikala: Thanks, Meryem. Thank you very much for joining this podcast. It's been great to discuss one of the very important topics in the AI space, the self-hosted models and deployment of those models into production. As we all know, any software solution will require more rigor and discipline after it goes into production, right? Not just development phase, but actually post-production and post-deployment efforts are more significant. So thanks for sharing your thoughts on that. And thank you for listening to this podcast. If you would like to learn more about AI, ML topics, check out the AI, ML and data engineering community page on infoq.com website. I encourage you to listen to the recent podcasts and also check out the articles and news items on the website. Thank you.

Mentioned:

ChatGPT
Anthropic
Llama 3
GPT-4o
Ollama
Retrieval Augmented Generation (RAG)
Hugging Face
Embedding search
Re-ranker search
4-Bit Precision Scaling Laws
Docker and Kubernetes
OpenAI's Triton

About the Author

Meryem Arik

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.