LangChain using Hugging Face Transformers.

Let’s write code to create a LangChain using Hugging Face Transformers.

Posted Feb 5, 2025

By DS2Man

3 min read

There are various ways to perform inference using LLMs, including OpenAI API, Hugging Face Transformers, and Ollama. However, due to company security policies, the use of ChatGPT is restricted, so methods involving OpenAI are unfortunately excluded. In this tutorial, we will focus on setting up LangChain using Hugging Face Transformers.(For reference, there are two main frameworks for RAG: LangChain and Semantic Kernel.) As part of this study, we explored two models registered on Hugging Face: polyglot-ko-1.3b and Llama-3.2-3B-Instruct.

1. Steps to Build LangChain

Load the model and tokenizer using transformers
Configure the pipeline
These steps are identical to the basic usage of LLMs (explicit approach).
Create a HuggingFacePipeline object for LangChain
Generate a prompt
Create a LangChain

  
from langchain.prompts import PromptTemplate
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

myModel = "Llama-3.2-3B-Instruct"

if myModel == "Llama-3.2-3B-Instruct":
    model_id = "./Pretrained_byGit/Llama-3.2-3B-Instruct"
else:
    model_id = "./Pretrained_byGit/polyglot-ko-1.3b"

# Step 1: Load the model and tokenizer using transformers
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = 0  # You can choose 0, 50256, or another token ID
    print(f"Pad token ID is set to: {tokenizer.pad_token_id}")
else:
    print(f"Pad token ID already set: {tokenizer.pad_token_id}")

# Step 2: Configure the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
    pad_token_id=tokenizer.pad_token_id,
)

# Step 3: Create a HuggingFacePipeline object for LangChain
llm = HuggingFacePipeline(pipeline=pipe)

# Step 4:  Generate a prompt
if myModel == "Llama-3.2-3B-Instruct":    
    template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a friendly AI assistant. Your name is DS2Man. Please answer questions briefly.
    <|eot_id|><|start_header_id|>user<|end_header_id|>{question}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
else:
    template = "### Question: {question}### Answer:"

prompt = PromptTemplate.from_template(
    template
)

# Step 5: Create a LangChain
chain = prompt | llm

2. invoke

invoke processes input data in a single instance and returns the response at once.
When a user provides input, the model generates the entire result and then returns it in one go.
The response is provided only after the model completes generating all the text.

  
reponse = chain.invoke({"question": "What is the capital of the United States?"})
print("Invoke Result:")
print(reponse)

Invoke Result: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a friendly AI assistant. Your name is DS2Man. Please answer questions briefly. <|eot_id|><|start_header_id|>user<|end_header_id|>What is the capital of the United States? <|eot_id|><|start_header_id|>assistant<|end_header_id|> The capital of the United States is Washington, D.C.

3. stream

stream is a method that partially returns results in real-time as the model generates text.
You can observe the process of the model generating the response text in real-time.

  
from langchain_core.messages import AIMessageChunk

response = chain.stream({"question": "What is the capital of the United States?"})
print("Streamed Result:")
answer = ""
iflag = 0
for chunk in response:
    if isinstance(chunk, AIMessageChunk):
        if iflag == 1: print("The type of chunk is AIMessageChunk...")
        iflag += 1
        answer += chunk.content
        print(chunk.content, end="", flush=True)
    elif isinstance(chunk, str):
        if iflag == 1: print("The type of chunk is str...")
        iflag += 1
        answer += chunk
        print(chunk, end="", flush=True)
    else:
        if iflag == 1: print(f"The type of chunk is {type(chunk)}...")
        iflag += 1

Streamed Result:
The type of chunk is str...
The capital of the United States is Washington, D.C.

LLM&RAG, L&R-Understanding

This post is licensed under CC BY 4.0 by the author.

1. Steps to Build LangChain

2. invoke

3. stream

Trending Tags