In this episode, Thomas Betts talks with Pamela Fox, a cloud advocate in Python at Microsoft. They discuss several ChatGPT sample apps that Pamela helps maintain. These include a very popular integration of ChatGPT with Azure OpenAI and Cognitive Search for querying enterprise data with a chat interface. Pamela also covers some best practices for getting started with ChatGPT apps.
Key Takeaways
- In a popular sample app, Azure Cognitive Search is combined with ChatGPT to provide a chat interface for finding information stored in enterprise documents.
- Retrieval Augmented Generation (RAG) is a technique for improving the usefulness of a large language model by providing it with a predetermined set of facts to use for its response.
- ChatGPT can act as a two-way interpreter between a person and Cognitive Search. Words typed by the person are translated into keywords that will provide better search results. And the results provided are then turned into natural language for the user to read.
- Building a system that incorporates an LLM will require effort to test and tune its behavior. Testing once and seeing a response isn’t enough, because the responses will change.
- Find the model that best fits your needs. For many cases, GPT 3.5 is sufficient. GPT 4 can provide better responses, but because it is slower and more expensive it may not be the right choice.
Subscribe on:
Transcript
Intro
Thomas Betts: Hello, and thank you for joining us for another episode of the InfoQ Podcast. Today I'm speaking with Pamela Fox, a Cloud Advocate in Python at Microsoft. Pamela is one of the maintainers on several ChatGPT samples, which is what we're going to be discussing today. It seems like every company is looking for ways to incorporate the power of large language models into their existing systems. And if you're someone that's been asked to do that, or maybe you're just curious what it takes to get started, I hope today's conversation will be helpful. Pamela, welcome to the InfoQ Podcast.
Pamela Fox: Thank you for having me.
Thomas Betts: I gave a brief introduction. I just said you're a Cloud Advocate in Python at Microsoft, but tell us more. What does that role entail and what services do you provide to the community?
Pamela Fox: Yeah, it's a great question. So I think of my role as helping Python developers to be successful with Microsoft products, specifically Azure products, but also VS Code and GitHub code spaces, Copilot, that sort of thing. So as it turns out, there are a lot of Microsoft products out there and there's many ways that Python developers can use Microsoft products. So I even have a colleague that's working on the whole Python in Excel feature that's recently come out. So all of that is something that our team works on.
So a lot of what we do is actually just deploy things. I'm technically a Python Advocate, but most of the time I'm actually writing infrastructure as code and deploying things to Azure servers, because a lot of what you want to do, if you're writing a Python web app, is you want to get it running on the cloud somewhere. So, quite a lot of my time is spent on that, but also using Azure Python SDK and that sort of thing.
Chat + Search Sample App [01:34]
Thomas Betts: This kind of jumps to the end of what we want to talk about, but it's what clued me into the work that you're doing, and I first learned about one of the sample apps that you work on at a local developer conference here, Denver Dev Day, and it was a presentation by Laurie Atkinson. Just got to give a shout out to the group and who talked about it. I think her talk may have just stolen the title of the repo, which is ChatGPT Plus Enterprise Data with Azure OpenAI and Cognitive Search. That presentation was a great use case that I think a lot of companies would probably be able to do and went well beyond just the simple Hello ChatGPT app, but there's at least four moving parts just from that title; ChatGPT, Enterprise Data, Azure OpenAI, and Cognitive Search. Can you walk us through how all those pieces are connected and what each of them means?
Pamela Fox: Yeah. So this is the sample that I spend a lot of my time maintaining right now. It was originally made just as a conference demo, but I think it was one of the first example apps out there that showed how to make a chat on your own data. And so it's been deployed thousands of times at this point and it's got a community of people actually contributing to it.
So what it is is using an approach that's called retrieval-augmented generation, RAG, and the idea is that you can keep ChatGPT constrained to your data by creating a prompt that says, "Answer this question according to this data," and then passing in chunks of data. So when we have this application and we get a question from the user, we need to take that question from the user and search for the relevant documents. So in this case, we're searching Azure Cognitive Search where we can do a vector search, we can do a text search. The best thing is actually do a hybrid, do both those things. So then we get the results.
And these results are actually, they're already chunked to be ChatGPT sized because ChatGPT, you can't give it too much information, it'll get distracted. There's a paper about this called, I think Lost in the Middle. So you've got to keep that chunk size. So we get back the ChatGPT sized chunks for that user question, and then we put those chunks together and we send those chunks and the original user question to ChatGPT and tell it to please answer it. And then we get back the response. Sometimes we ask for follow-up questions. So that's a simplified version of it. We actually have different approaches that we use that use slightly different prompting mechanisms and different chains of calls and sometimes use ChatGPT function calls. But the simplest way of thinking of it is get your search results in chunks, send those chunks with the user question to ChatGPT.
Thomas Betts: And you said you start with your enterprise data. So what types of enterprise data are we talking about? Stuff that's in the applications I write? Stuff that's on our intranet, on SharePoint or shared drives or wherever people are storing things? Is there anything that it can or can't search very well?
Pamela Fox: Yeah, that's a good question. So right now, this demo actually only supports PDFs. Now as it turns out, PDFs are used quite a lot at enterprise and you can also turn many things into PDFs. So this is a limitation and we've developed other samples and we're working on adding support for other formats as well, because people want HTML, they want CSV, they want database queries, that sort of thing. So right now this sample is really built for PDFs and we actually ingest it using Azure Document Intelligence, which is particularly good at extracting things from PDFs. So it'll extract all this information and then we have this logic that chunks it up into the optimal size for ChatGPT.
So that works for many people. I've got a branch where I wanted to have it work on documentation, and so I crawled the documentation in HTML and then I converted that HTML into PDFs and then ingested it that way. So anything you can turn into PDFs, you can work with this. Lots of people do connect it with stuff stored in their SharePoint or blob storage or whatever storage mechanism they're using, S3, whatever, but the idea is the PDFs right now. There's lots of other repos out there that can help you with ingesting other data formats. You just need to get it into the search index in good chunks. That's the key is to get it in the right sized chunk.
Azure Cognitive Search with exact, vector, and hybrid models [05:40]
Thomas Betts: And what's the search index in this case? It's one of the Azure services I presume?
Pamela Fox: Yeah, so we are using Azure Cognitive Search and we do recommend using vectors with it. So we started off with just text search, but then we added vectors and they did a test on the Cognitive Search team to compare doing text, vectors, hybrid, and then also adding this what's called an L2 re-ranking step. So it's like you get back your search results and then you can apply an additional machine learning model, this is called an L2 re-ranker, and then it just does a better job of getting things to the top spots that should be the top spots. So they did a big analysis across various sample data and determined that the best approach overall is to use hybrid plus the re-ranker.
It's not always the best thing for every single query. I can give an example of a query where this won't work as well. So let's say you have a bunch of documents and they're for weekly check-ins and you've weekly check-in number one, number 10, number 20, if you do a hybrid search for weekly check-in number one, that actually may not find number one because number one, if you're using a vector search, number one has a semantic similarity to a lot of things. That was an interesting situation and semantics search team is actually looking into that. But you will find that overall hybrid is the best approach, but it is interesting to see, especially with the vectors where it can mess up something that would've been better as an exact search.
So this is the sort of thing that when you bring in your own data, depending on what your own data looks like and you start doing experiments, you can see how these different search options are working for you. But it's interesting because a lot of times all you hear about these days is vector search, vector search, vector search, and vector search can be really cool because it can bring in things that are semantically similar like dog, cat, right? Bring in those. But if you are in a particular use case where you really don't want to get dog if you're asking for cat, then you have to be really careful about using vector search, right?
Thomas Betts: Yeah. So what's the layman's definition of what's vector versus exact text searching? 05:40
Pamela Fox: So exact text searching, you can think of it as string matching and still can include spell check and stemming. So stemming means you have a verb walk, a stem would be walked or walking, right? So that's the sort of thing you would get out of a good text search that you would expect is the spell check and stemming. So that's going to work well for lots of things, but when we bring in vector search, that gives us the ability to bring in things that are similar ontologically. So you can't even imagine the space of words in our language or in any language, because you can do it across multiple languages, you imagine the space of words and you imagine if you're going to cluster them, how would they be similar to each other?
So dog and cat, even though they're spelled completely different in English, are semantically really similar because both animals, they're both pets. So that means if you searched for dog and there was no results like that, but there was a result for cat, then you could end up getting that result. So it works well in the case that you didn't have an exact match, but you found something that was in a similar ontological space.
Thomas Betts: Yeah, I think the example I saw, and this is probably part of the demo, is searching for HR and benefits documents. And so going with that example, I was looking for, how do I get insurance for my dog? And it might come up with vet insurance in general and it'll figure out that that's kind of the area that you wanted to search in even though you didn't say veterinarian.
Pamela Fox: Yeah, yeah, and we were doing another test with looking for eye appointments and there it found vision, right? It never mentioned eye, so that sort of thing, even something like looking for eye and it found preventative, so it thought that preventative was similar to eye appointment, because I guess it's a form of preventative care. The semantic space can get things that are really similar and also can capture things that are a bit farther.
Integrating Cognitive Search with ChatGPT [09:19]
Thomas Betts: The computers are better than the humans at remembering all those little relationships you wouldn't think of sometimes. So I think we've covered the Enterprise Data, we've covered Cognitive Search, and then integrating ChatGPT, like you said, you have to chunk the question and the data that you're feeding it into ChatGPT, you explained it as give me an answer, but here's the data I want you to provide the answers on. So you aren't pointing ChatGPT at your search index. You're giving it the results of your search index and that's chunked up?
Pamela Fox: Yeah, that's right. Yeah. There is actually another team that's working on an actual extension to ChatGPT where you would actually just specify, "Here is my search index," and it would just use that search index. This is such a common use case now that everybody is trying to figure out how can we make this easier for people? Because clearly there's a huge demand for this. Lots of enterprises want to do this actually. So there's lots of different teams trying to come up with different approaches to make this easier, which is great because we want to make it easier for people. So there is a team that's working on an extension to the ChatGPT API where you would literally specify, "This is my search index," and it would basically do what we're doing behind the scenes.
In our sample, we do it manually, which is cool if you want to be able to tweak things a bit further and actually have control of the prompt. If you're trying to bring in very different sources as well, you could bring those in. So in our repo, we've got the system message. So the system message is the thing you first tell the ChatGPT to say to give it its main guidance. So we say like, "Okay, ChatGPT, you are a helpful assistant that looks through HR documents, you'll receive sources. They're in this format. You need to answer according to the sources. Here's an example. Now here's the user question and here are the sources. Please answer it."
Thomas Betts: And I like the idea of making that just plug and play as opposed to someone has to do that setup, because it seems like there's a little bit of fine tuning. Going through the example, it's fairly straightforward how you could get set up and then start plugging your data in. And then you said you have to practice it and figure out what's right for your specific case. How do you do all the little tuning? How does someone go through and figure out what is the right tuning setup for their environment?
Pamela Fox: That's a good question. So people will check out the repo, they'll try it with the sample data, then they'll start bringing in their own data and start doing questions against it. And usually, they start taking notes like, "Okay, it seems like the citation was wrong here. It seems like the answer was wrong here. Maybe the answer was too verbose." And in that case, I tell them to start breaking down. So we actually show the thought process in our UI to help people with debugging what's happened, because the thing you have to figure out is, is the issue that your data was chunked incorrectly? Sometimes that happens, so yeah, that was a situation we saw where the data wasn't chunked optimally, it was chunked in the middle of a page and we just needed to have a different chunking approach there.
Is the issue that cognitive search didn't find the optimal results? And there you want to look at stuff like, are you using hybrid search? Are you using the re-ranker? What happens when you change those things? And then finally is the issue that ChatGPT wasn't paying attention to the results? So most often, CHatGPT is actually pretty good at paying attention to results. So issues with ChatGPT we've seen are more around maybe it being too verbose, giving too much information, or just not formatting something the way somebody wanted. If they wanted marked down versus a list or something like that. A lot of times issues is actually at the search stage, because searching is hard and you have this vision in your head of, this is obviously the right search result for it, but it may not actually be what's output. There's lots of configuring you can do there to improve the results.
Using ChatGPT as the two-way interpreter [12:55]
Thomas Betts: Yeah, like I said, there's at least four moving parts you have to identify which is the one that's causing you to go a little bit off of where you're trying to get to. And so it might be Cognitive Search. When you're asking the question, is that all part of Azure Cognitive Search or are you feeding the question into ChatGPT and it's turning into something else that you ask Cognitive Search?
Pamela Fox: Okay, so yeah, you got it. That's actually what we do. So I often glaze over that, but in our main approach that we use, we actually take the user query and then we tell ChatGPT to turn that into a good keyword search. And we give it a lot of examples too. So we use few-shot prompting as it's called. So we give it multiple examples of, "Here's a user question, here's a good keyword search. Here's a user question, here's a good keyword search." And we're trying to accommodate for the fact that many users don't write things that are necessarily the optimal thing to send into a search engine.
So that's actually the first call we make to ChatGPT is to turn that query into an appropriate keyword search. So that would be another thing to look at when you're debugging this, if you're not liking the results, did ChatGPT do a good job of turning that user question into an appropriate keyword query? And usually it does, but it's another step to look into.
Thomas Betts: So it sounds like you've got ChatGPT as the interpreter both going in and coming out of everything that's underneath it, all of the data, all of the Cognitive Search, but the idea that the computer is better at talking to the other computers, let's put that barrier both in and out. So it translates it from human into keywords and then from responses back into human.
Pamela Fox: Yeah. Yeah, that's right, which is very interesting and something that I should point out is that right now, if we start messing with the prompts, because there's a lot of prompting involved here, and so we might do some prompt tweaking, prompt engineering as they call it, and we might think like, "Oh, okay, this does improve the results," but in software development, we want to have some amount of confidence that the improvements are real good tangible improvements, especially with ChatGPT because ChatGPT is highly variable. So you can't test it once and be like, "Oh, that was definitely better," because it's going to actually give a different response every time, especially right now. So there's a temperature parameter you can use between zero and one and one is like most variable, zero is least variable, even zero, you'll have variability with the way LLMs work. So, we have 0.7 right now, huge amount of variability. So how do you actually know if you have improved the prompt that it is actually an improvement?
So I'm working on a branch to add an evaluation pipeline. So what you do is you come up with a bunch of ground truth data, it's a question answer pairs, and don't worry, we can use ChatGPT to generate this ground truth data, because what you do is you point it at your original input data and say, "Come up with a bunch of questions and answers based off this data." So you have your ground truth data and then you point the evaluator at that data and at your current prompt flow and then tell it to evaluate it. And what it actually does is that it calls your app, gets a result, and then uses ChatGPT to evaluate it.
Usually you want to use ChatGPT-4 in this case because ChatGPT-4 is the more advanced one. So this is a use case where you usually want to use ChatGPT-4, and it's okay that it's a little more expensive, because you're not going to use it for that many queries, but for every single question answer, you ask ChatGPT like, "Hey, here's what we got. Here's the ground truth. Can you please measure this answer in terms of relevance, groundedness, fluency, and some other metrics I don't remember." But that's the approach to evaluation and that's hopefully what will enable people to easily do on their chat apps is to be able to say, "Okay, I've made a change. Is it legitimately a better change before I merge this prompt change into production?"
Thomas Betts: Yeah, that's a lot to think about because people want to know that it's working, but I can't just write a unit test for it to say, "Oh, it's good," because calling it isn't good enough. The responses are what matters. And the response, the fact that it changes, even if you tell it, give me the same, was it the temperature the value that you feed it and say, "Set that to no, zero, don't change it at all." It still gives different answers?
Pamela Fox: Yep.
Thomas Betts: Yeah, these are the things that people need to know when they start out and it's like, "It doesn't do quite what I expected and how do I figure out?" So providing those in the samples and tutorials is very helpful to say, "Hey, we know it's going to be a little different, but that's expected. And here's setting peoples expectations."
A simple chat app and best practices [17:16]
Thomas Betts: So, I really like the sample. I think it's really useful. Like you said, a lot of corporate partners or a lot of companies are going to want to do something like that, but what if corporate data and Cognitive Search isn't what somebody's going to get started on? You have another simple chat app. What do you think that that's meant to teach developers who pull that down and go through the tutorial?
Pamela Fox: Ah, you found my other sample. Very few developers pull that down, because most people, they want the enterprise chat app. So that app was an experimentation to make sure we can use best practices like containerization. And so that one actually gets deployed to container apps and also in showing very simply how one can use managed identity. So it's trying to be the minimal example to show various best practices. So containerization, managed identity, and streaming, it also does show how to do streaming. And also it uses an asynchronous framework. So it's only 20 lines of code, I think, compared to this other app which is, I don't know, getting on hundreds or thousands now. But the goal of that is to be a succinct example of some of the high level best practices for using these SDKs.
Thomas Betts: Yeah. And I think that's useful because sometimes I go onto Stack Overflow and I want to just post, I'm like, "I'm having this bug." And what's useful is when someone's able to produce the smallest amount of code that reproduces their bug and it's like just the act of doing that sometimes answers your own question, but instead of pulling in all of these things and wondering which of the large moving parts isn't working, having the simple app to just get started. And like you said, it can be useful just to teach those things, can I create the containers and get it deployed in my environment? So I think that's useful.
Mocking ChatGPT [18:51]
Thomas Betts: You did highlight some of the things I wanted to get into because I read through your blog and I found a series of posts on best practices for OpenAI chat apps and I have a feeling they all came out of this sample, but if we can just go through some of them. The first one that I thought was interesting was about mocking the calls to OpenAPI when you're testing, and that's counterintuitive because I thought, isn't the whole point of this that I want to test that it's working? Why would I mock that?
Pamela Fox: Well, we have different levels of tests. So at this point in the code base, I've got two levels or maybe I guess three levels of tests. So I've got unit tests, function in, function out, I've got integration tests, and those integration tests, I want to be able to run them really quickly. So that is where I'm mocking out all of the network calls, I don't want my integration tests to make any network calls because I run all of them in a minute. So I run hundreds of tests in a minute. And then even my end-to-end test, so those are using Playwright, which is like Selenium. So if you've done any sort of browser end-to-end testing, you're going to use one of these tools.
And it's actually kind of fun. What I do is in the Python backend test, I use snapshot testing, which is the idea that you save a snapshot, you save the results, so I save the response I get from server, I save it into a file and then going forward, the file always gets diffed. So if anything changed in that response, the test will fail and either I need to fix the issue or I say, "Okay, actually it was supposed to change because I changed the prompt or something." And then it updates all the snapshots. So I've got all these snapshots that show what the responses should be like for particular calls. And then in my front end test, like my end-to-end test, I use those snapshots as the mocks for the front end. So the front end is testing against the results of the backend. So that's pretty cool because it means at least the front end and the backend are synced up with each other in terms of your tests.
Now the final question is, how do we test that the mocked calls, something doesn't change? If OpenAI changes their SDK or if any of the backend network calls are acting funny, we could still have a broken app. So we still need what we would call smoke tests, I'd call them smoke tests, which is, you've got your deployed app, does your deployed app work? So I do have a to-do on a post-it here that says to write smoke test. And so what I'd probably do is do something really similar to my Playwright test, but I just wouldn't mock out the backend. I would just do it against the thing. I haven't set that up yet, mostly because it does require authentication and we're figuring out the best way to store our authentication in a public repository. It would be a lot easier if this was a private repo, but because this is a public repo, we've been debating the right approach to having CICD do a deploy and a smoke test. In a private repo, I think it would be more straightforward.
Avoid using API keys [21:26]
Thomas Betts: I'm going to jump from that onto one of your other tips, which was about security and authentication, and I think people are used to using API keys for authentication and it seems like I just get my API key and I'll shove it in there. And you said, don't do that. And I know you're talking about in the world of Azure, but I think you talked about using a keyless strategy. Why do you think that's important as opposed to just API keys? Because they're easy.
Pamela Fox: Yeah, they certainly are easy, but it's fine if it's a personal project. But when you're working inside a company, you increasingly do not want to use keys because the thing is if you're inside a big company, like Microsoft or maybe smaller than Microsoft, but anyone in the company could, in theory, use that key. If they get ahold of that key, they can now use that key. And so you can end up in this situation where multiple teams are using the same key and not knowing it. So that means you're using up each other's quota, and how do you even find out where these other people are that are using your key, right? That's an awkward thing. It's actually something that my friend ran into the other day with using keys at their company. They're like, "I can't figure out who's using our team's key."
So that's a situation, but then obviously huge security issues. I see people push their keys to GitHub every day. It just always happens, right? You put your key in a .env file and you accidentally check that in, even though we have it in our Git Ignore and now your key's exposed. So, there's both security and there's tracking. And when you're working inside a company, it's better to use some sort of keyless strategy. In this case, what we do is we give explicit roles. So we make a role for the hosted platform. We say, "Okay, this app, this App Service app has the role where it's allowed to access this particular OpenAI." So we set up a very specific role access there and then also we set it up for the local user and say, "This local user specifically can use this OpenAI." So we're setting up a very specific set of roles and it just makes it a lot clearer who can do what and you don't end up with this loosey goosey, everyone's using each other's key.
Thomas Betts: And then going back to the sample application, does everyone just have to be on Azure Active Directory and that just allows you to use individuals? Or are you still talking about an application account that I set up for my team that isn't just the one API key?
Pamela Fox: Let's see. The way we did it for this sample is that we create a role for the app service that you deploy to and then we just create a role for the local user. I think you could, in theory, create what's called a service principal I think, and then use that and grant it the roles. We even have a script you can run that'll go and assign all the necessary roles to a current user. So I think you could use any approach, but we default to setting up the user's local roles and the deployed apps roles.
The importance of streaming [24:04]
Thomas Betts: So one of the other tips you mentioned already was streaming. You set that up. Why is streaming important? I think, again, it's easy to set up request response, but the user experience that people see when they use ChatGPT or any of the other ones, it's constantly spitting the words of text out. So is that what the streaming interface gets you and why is that something to do and why is it complicated to set up?
Pamela Fox: Yeah, streaming has been a whole thing. My gosh, I was actually just debugging a bug with it this morning. Yeah, so when this sample first came out, it did not actually have streaming support, but it became a big request. And so we ended up adding streaming support to it and there are a lot of benefits to adding it. So one is the actual performance. If you have to wait for the whole response to come back, you actually do have to wait longer than if you had streaming, because it's actually getting streamed not just from your server, it's getting streamed from OpenAI. So you can imagine that our server is opening up a stream to OpenAI and as soon as it's getting tokens in from OpenAI, it's sending it to the front end. So you can actually, especially for long responses, for the user, their experience will be that the response comes quicker because they start to see those words flow in more quickly.
Because if it was just a matter of people like the word by word effect, then you could just fake it out and just get the whole response from ChatGPT and just fake it out on the front end. But you want to actually get that performance benefit, especially with long responses where you start seeing those words as soon as ChatGPT start generating them.
So that's why it's important why people like it, they're used to it, perceived better performance, faster response. In terms of the complexity of it, it's very interesting because when I first implemented streaming, I used this protocol called server-sent events, and we can link that explanation of it, but a server-sent event, it's a protocol where you have to omit these events from your server that have this "data:" in front of them, and then on the front end you have to parse them and you have to parse out what's after the "data:" and it's a whole thing. So it actually requires a fair amount of effort, because on your server you got to be outputting these "data:" formatted events, and on the front end you got to parse those in and then you have to do this explicit closing of the connection.
So the reason I use server-sent events is because that's actually what ChatGPT uses behind the scenes. So their rest API is actually implemented using server-sent events. Most people don't know that because they're using the SDKs on top. So most of us, and us as well, we use the SDK on top, which just generates a stream of objects using a Python generator and we consume it that way. But behind the scenes, it is actually using server-sent events, and so that's what everybody told me to use, but then I actually tried it and I realized like, "Oh my gosh, this is not a good developer experience and we do not need this complexity."
So I changed it to instead use just a simple HTTP stream, so all that means is that your header is transfer encoding chunked. That's it. So you set a header transfer encoding chunked, and then use your framework to stream out a response. And so the response will come into the front end a chunk at a time, and what we do is we stream out newline-delimited JSON, also known as JSON Lines or Line JSON, there's lots of names for it, but basically you stream out chunks of JSON that have new lines and then on the front end, you compile those back together until you've got a fully parsing JSON. And that part's a little tricky, so I did make an NPM package for that. So if you find yourself doing that, you can use my NPM package and it'll do the partial JSON parsing and just give it to you as an asynchronous iterator.
Thomas Betts: Yeah, I've dealt with JSON lines or JSON L or whatever you want to call it. Everyone has a different name, but that makes a lot of sense now, you say that each line comes across and eventually you get the full JSON, but you're missing the first and last curly brace on it and the bracket for the array. It's everything in between.
Examples in other languages than Python [28:07]
Thomas Betts: So all of your examples, because you are a Python advocate, are in Python, but is there anything that's Python specific? I'm mostly a C# developer. Would I be able to read through this and say, "Okay, I can figure out how to translate it and do the same thing?" Are there other samples out there in other languages?
Pamela Fox: Yeah, so that's a great question because actually we've been working to make this sample basically port it to other languages because it's really interesting because lots of people using the sample, clearly this might be the first time they're using Python and it's great to bring people over to the Python side, but also if you're a C# developer, I don't want to force you to like Python. Everyone has their own particular language and we basically are never ever going to agree on that, so we're going to have a billion languages forever.
So knowing that, we have ported the sample over to multiple languages, so we do have one in C#, we have one in Java, and then we have one in JavaScript like node backend. So we're trying to have feature parity across them. They're not perfectly in sync with each other, the Python sample, because it's very popular and has been out for a while, it does have a few more things, more experimental things, but we've agreed on a common protocol, so it's cool, you could actually use the JavaScript front end, they use web components, with our backend because we're trying to speak the same protocol. So we've aligned on a common protocol. So we're trying to make it so that you can pick the language of your choice. Because there will probably be slight differences, especially the OpenAI SDK is probably going to be slightly different across each of them. So, yeah, pick your flavor.
Pricing [29:23]
Thomas Betts: And then since you brought it up, I feel like I have to ask something about pricing. Now every time I talk to someone at Microsoft, the canned answer is, "I'm not an expert on pricing," and that's fine. I know the answer is always, "It depends," but you brought up a good point about sharing your API keys and someone else starts using your quota, and I think people understand that large language models have this incurred cost and people aren't really sure, should I use ChatGPT-3 or 3.5 or 4? In general terms, what are some of the big points of concern where cost becomes a factor and whether you're using the sample apps or building a custom solution that uses similar resources, where do the big gotchas people need to know to look out for when it comes to, my pricing went off the rails?
Pamela Fox: So I think some of the things that surprise people are our repo defaults to using Azure Document Intelligence for extraction, because it's very good at PDF extraction, that does cost money because it's a service that costs money, and if you were ingesting thousands and thousands of PDFs, then that will run up a budget. For the sample data, it doesn't run up a budget, but if you are ingesting a huge number of PDFs, that will definitely run up the budget and it's a per page cost. So we have a link to that so you can do the calculation. So I have seen people comment on that. You can use your own. So we also do support a local PDF parser, so if that is good enough, then you could use, it's a Python package that just does local PDF parsing, so we try to have backup options when it's possible.
The other thing is Azure Cognitive Search, so that the pricing is going to depend on whether you're using options like semantic search and if you need additional replicas, we think that most people are fine with the default number of replicas, but semantic search does currently cost extra. I think it's, yeah, we're not supposed to give exact price, but it is around a couple hundred dollars a month right now, depending on the region. So that is definitely money. For some enterprises, that's not significant, right? Because it's cheaper than paying someone to build a search engine from scratch. I don't know how to build an L2 re-ranker, I just learned that term. If that's prohibitive for someone, then they would turn off semantic search. And then Cognitive Search does have, I think they have a free tier, but I don't know that we default to it. So Cognitive Search can cost money.
And then there's OpenAI. So OpenAI, I think our costs are actually similar or the same as OpenAI's, but don't quote me, this is a podcast, so I guess you're going to quote me, but you can look at the prices for that. That's going to be per token. And so that's why people do prompt engineering to try not to send in big tokens. And it's also per model, right? So you're asking ChatGPT-3.5 versus 4, we tell people to try 3.5, I use 3.5 for all of my samples as a default, and it seems pretty good. So we tell people to start with 3.5, and see how far you can go 3.5, because you don't want to have to go to 4 unless it's really necessary, because 4, it's both going to be slower and it's going to cost more. So especially for something user facing, given that it's usually a little bit slower right now, you don't necessarily want that.
So that's also why evaluation pipelines are important because ideally you could check 3.5 against your evaluation pipeline, check 4, and see is the difference really big? And also look at your latency rate. What's the latency rate like if I use 3.5 versus 4, and is that important to us?
Thomas Betts: Well, that's a lot to get into. And like you said, none of this is free, but it's all useful. So it's up to individual companies to figure out, is this useful for what we want to do? And yeah, probably cheaper than paying someone to write it all from scratch.
Following up [32:35]
Thomas Betts: So, we'll include links to your blog, to the sample apps in the show notes. If people want to start using them and they have questions, what's the best way to get in touch with you or your team?
Pamela Fox: So, I subscribe to all of the issues on that repo, so just filing an issue in that repo is a pretty good way of getting in touch with me. It goes straight to my inbox, because I have not figured out Outlook filters yet, so I just have an inbox full of those issues. So that's one way. There's also the AI for Developers Discord that the AI Advocacy team started.
Thomas Betts: Well, I hope some of this has been useful to our listeners and they now know a few more ways to get started writing apps that use ChatGPT and Azure OpenAI. Pamela Fox, thank you so much for joining me today.
Pamela Fox: Sure. Thank you for having me on. Great questions.
Thomas Betts: And listeners, I hope you join us again on a future episode of the InfoQ Podcast.