# NYU LANGONE HEALTH - OLAB LLM INFERENCE SERVICE
See below for information on the MODEL ZOO

# Introduction
This is our temporary documentation for the NYULH - OLAB H100 inference service. This service was created to offer an alternative to existing commercial and institutional offerings. Unlike commercial vendors, this service is intended to offer free access to state of the art open source LLMs. Unlike existing free and institutional services, this service will be effortless and only require a simple OpenAI-style API call. This service will also be high performance. We will include the state of the art hardware and techniques for serving LLMs, and use this service as a research platform to begin our own investigations into LLM inference engineering. Lastly, and most importantly, this service will also eventually include accesss to our own internal models (NYUTron, Lang-One, etc...)

# GUI:

- The GUI can be accessed on the NYU Langone wifi network at http://10.189.26.12:31111

# API:

## Account Creation

1. Talk with Eric Oermann (eric.oermann@nyulangone.org) or Jaden Stryker (jaden.stryker@nyulangone.org) about getting access. For this initial test trial we will be using your @nyulangone.org email to authenticate and providing an account specific API token.
2. Please let us know if you will be dealing with PHI. We will turn off logging and all requests will be ephemeral, but we'd like to be aware of any PHI passing through the platform.
3. Please do not share API keys. We want to keep track of how many users, requests per minutes, and other important statistics to improve the system.

## Making A Request

1. You must be on NYU Langone Campus or shelled into a NYU server to use the inference service.
2. The API specifications are built to be similar to OpenAI, so if you're comfortable using the AzureAI or OpenAI services, this will behave identically.
3. You can use any request framework such as curl, python requests, or https module in JS. In this userguide directory you can see an example of making a request using the python requests library
3. Here are some examples: 

## Examples

### Curl example
```
curl -H 'apiKey: eric.oermann@nyulangone.org' -X 'POST' 'http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
  "model": "llama3-3-70b-chat",
	"messages": [{"role":"user", "content":"Hello there how are you?"},{"role":"assistant", "content":"Good and you?"}, {"role":"user", "content":"When was NYU Langone Hospital founded?"}],
  "max_tokens": 50,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "string",
  "frequency_penalty": 0.0
}'
```

### Python example
```
url = "http://10.189.26.12:30080/model/llama3-3-70B-DSR1/v1/chat/completions"
headers = {
    "apiKey": "eric.oermann@nyulangone.org", # your email ( ALL LOWERCASE ) should go here
    "accept": "application/json",
    "Content-Type": "application/json"
}
messages = [
{"role":"user", "content":"Hello there how are you?"},
{"role":"assistant", "content":"Good and you?"},
{"role":"user", "content":"When was NYU Langone Hospital founded?"}]
data = {
    "model": "llama3-3-70B-DSR1", 
    "messages": messages, 
    "max_tokens": 200,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "string",
    "frequency_penalty": 0.0
}
response = requests.post(url, headers=headers, json=data)
print(response.status_code)
print(response.json())
```
### OpenAI Completions example
```
from openai import OpenAI
client = OpenAI(
    base_url="http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/",
    api_key="jaden.stryker@nyulangone.org"
)
response = client.chat.completions.create(
    model="llama3-3-70b-chat",
    messages=[
        {"role": "user", "content": "San Francisco is a"}
    ],
    max_tokens=5,
    top_p=1,
    n=1,
    stream=False,
    stop='1111',
    frequency_penalty=0.0
)

print(response)
```

models can be changed out by swapping the model id found in the table above: llama3-3-70B-DSR1 -> llama3-70b-chat

### Sampling Params
https://docs.vllm.ai/en/v0.6.4/dev/sampling_params.html

### Error Messages
- 401 -> **Unauthorized** -> Double check your api key. Make sure it is lower case
- 404 -> **Not Found** -> Double check the URL
- 500 -> **Internal Server Error** -> Make sure your messages are below 4000 tokens ~ 3000 words. Otherwise this one is on our side, let jaden.stryker@nyulangone.org know if this continues to occur
- 429 -> **'message': 'API rate limit exceeded'** -> You exceeded the max amount of calls per second. 
- **no Route matched with those values** -> You cannot have a trailing slash in the model url