There are various ways to perform inference using LLMs, including OpenAI API, Hugging Face Transformers, and Ollama. However, due to company security policies, the use of ChatGPT is restricted, so methods involving OpenAI are unfortunately excluded. In this tutorial, we will focus on setting up LangChain using Hugging Face Transformers.(For reference, there are two main frameworks for RAG: LangChain and Semantic Kernel.) As part of this study, we explored two models registered on Hugging Face: polyglot-ko-1.3b and Llama-3.2-3B-Instruct.
1. Steps to Build LangChain
- Load the model and tokenizer using transformers
- Configure the pipeline
These steps are identical to the basic usage of LLMs (explicit approach).
- Create a HuggingFacePipeline object for LangChain
- Generate a prompt
- Create a LangChain
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| from langchain.prompts import PromptTemplate
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
myModel = "Llama-3.2-3B-Instruct"
if myModel == "Llama-3.2-3B-Instruct":
model_id = "./Pretrained_byGit/Llama-3.2-3B-Instruct"
else:
model_id = "./Pretrained_byGit/polyglot-ko-1.3b"
# Step 1: Load the model and tokenizer using transformers
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = 0 # You can choose 0, 50256, or another token ID
print(f"Pad token ID is set to: {tokenizer.pad_token_id}")
else:
print(f"Pad token ID already set: {tokenizer.pad_token_id}")
# Step 2: Configure the pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.1,
pad_token_id=tokenizer.pad_token_id,
)
# Step 3: Create a HuggingFacePipeline object for LangChain
llm = HuggingFacePipeline(pipeline=pipe)
# Step 4: Generate a prompt
if myModel == "Llama-3.2-3B-Instruct":
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a friendly AI assistant. Your name is DS2Man. Please answer questions briefly.
<|eot_id|><|start_header_id|>user<|end_header_id|>{question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
else:
template = "### Question: {question}### Answer:"
prompt = PromptTemplate.from_template(
template
)
# Step 5: Create a LangChain
chain = prompt | llm
|
2. invoke
- invoke processes input data in a single instance and returns the response at once.
- When a user provides input, the model generates the entire result and then returns it in one go.
- The response is provided only after the model completes generating all the text.
1
2
3
| reponse = chain.invoke({"question": "What is the capital of the United States?"})
print("Invoke Result:")
print(reponse)
|
1
| Invoke Result: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a friendly AI assistant. Your name is DS2Man. Please answer questions briefly. <|eot_id|><|start_header_id|>user<|end_header_id|>What is the capital of the United States? <|eot_id|><|start_header_id|>assistant<|end_header_id|> The capital of the United States is Washington, D.C.
|
3. stream
- stream is a method that partially returns results in real-time as the model generates text.
- You can observe the process of the model generating the response text in real-time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| from langchain_core.messages import AIMessageChunk
response = chain.stream({"question": "What is the capital of the United States?"})
print("Streamed Result:")
answer = ""
iflag = 0
for chunk in response:
if isinstance(chunk, AIMessageChunk):
if iflag == 1: print("The type of chunk is AIMessageChunk...")
iflag += 1
answer += chunk.content
print(chunk.content, end="", flush=True)
elif isinstance(chunk, str):
if iflag == 1: print("The type of chunk is str...")
iflag += 1
answer += chunk
print(chunk, end="", flush=True)
else:
if iflag == 1: print(f"The type of chunk is {type(chunk)}...")
iflag += 1
|
1
2
3
| Streamed Result:
The type of chunk is str...
The capital of the United States is Washington, D.C.
|