Document Loaders in Langchain

In this article, we will be looking at multiple ways which langchain uses to load document to bring information from various sources and prepare it for processing. These loaders act like data connectors, fetching information and converting it into a format Langchain understands.

There are a lot of document loaders in LangChain and you can find them here.

We will cover some of them:

  • TextLoader
  • CSVLoader
  • JSONLoader
  • DirectoryLoader
  • PyPDFLoader
  • ArxivLoader
  • Docx2txtLoader
  • WebBaseLoader
  • UnstructuredFileLoader
  • UnstructuredURLLoader
  • YoutubeAudioLoader
  • NotionDirectoryLoader

TextLoader

from langchain_community.document_loaders import TextLoader

text = '/content/check.txt'
loader = TextLoader(text)
loader.load()

# Output

[Document(page_content='India, country that occupies the greater part of South Asia. India is made up of 28 states and eight union territories, and its national capital is New Delhi, built in the 20th century just south of the historic hub of Old Delhi to serve as India’s administrative centre. Its government is a constitutional republic that represents a highly diverse population consisting of thousands of ethnic groups and hundreds of languages. India became the world’s most populous country in 2023, according to estimates by the United Nations.', metadata={'source': '/content/check.txt'})]

CSVLoader

CSV files are a common format for storing tabular data, and the CSVLoader provides a convenient way to read and process this data.

import pandas as pd

# Create a simple DataFrame
data = {
    'Name': ['Rohit', 'Ayaan', 'Ajay', 'Sandesh'],
    'Age': [26, 20, 23, 23],
    'City': ['Delhi', 'Mumbai', 'Noida', 'Chicago']
}
df = pd.DataFrame(data)

# Export the DataFrame to a CSV file
csv_file_path = 'sample_data.csv'
df.to_csv(csv_file_path, index=False)

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='sample_data.csv')
data = loader.load()
data

# Output

[Document(page_content='Name: Rohit\nAge: 26\nCity: Delhi', metadata={'source': 'sample_data.csv', 'row': 0}),
 Document(page_content='Name: Ayaan\nAge: 20\nCity: Mumbai', metadata={'source': 'sample_data.csv', 'row': 1}),
 Document(page_content='Name: Ajay\nAge: 23\nCity: Noida', metadata={'source': 'sample_data.csv', 'row': 2}),
 Document(page_content='Name: Sandesh\nAge: 23\nCity: Chicago', metadata={'source': 'sample_data.csv', 'row': 3})]

When you load data from a CSV file, the loader typically creates a separate “Document” object for each row of data in the CSV.

By default, the source of each Document is set to the entire file path of the CSV itself. This might not be ideal if you want to track where each piece of information comes from within the CSV.

You can specify a column name within your CSV file using source_column. The value in that specific column for each row will then be used as the individual source for the corresponding Document created from that row.

loader = CSVLoader(file_path='sample_data.csv', source_column="Age")

data = loader.load()

data

# Output

[Document(page_content='Name: Rohit\nAge: 26\nCity: Delhi', metadata={'source': '26', 'row': 0}),
 Document(page_content='Name: Ayaan\nAge: 20\nCity: Mumbai', metadata={'source': '20', 'row': 1}),
 Document(page_content='Name: Ajay\nAge: 23\nCity: Noida', metadata={'source': '23', 'row': 2}),
 Document(page_content='Name: Sandesh\nAge: 23\nCity: Chicago', metadata={'source': '23', 'row': 3})]

This becomes particularly helpful when working with “chains” that involve answering questions based on the source of the information. By having individual source information for each Document, these chains can consider the origin of the data while processing and potentially provide more nuanced or reliable answers.

JSONLoader

JSONLoader is designed to handle data stored in JSON.

[
    {
        "id": 1,
        "name": "Ajay Kumar",
        "email": "ajay.kumar@example.com",
        "age": 23,
        "city": "Delhi"
    },
    {
        "id": 2,
        "name": "Rohit Sharma",
        "email": "rohit.sharma@example.com",
        "age": 26,
        "city": "Mumbai"
    },
    {
        "id": 3,
        "name": "Sandesh Tukrul",
        "email": "sandesh.tukrul@example.com",
        "age": 23,
        "city": "Noida"
    }
]

JSONLoaders utilizes the JQ library for parsing JSON data. JQ offers a powerful query language specifically designed for manipulating JSON structures.

The jq_schema parameter allows you to provide a JQ expression within the JSONLoader function.

!pip install jq

from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='example.json',
    jq_schema='map({ name, email })',
    text_content=False)

data = loader.load()
data

# Output

[Document(page_content="[{'name': 'Ajay Kumar', 'email': 'ajay.kumar@example.com'}, {'name': 'Rohit Sharma', 'email': 'rohit.sharma@example.com'}, {'name': 'Sandesh Tukrul', 'email': 'sandesh.tukrul@example.com'}]", metadata={'source': '/content/example.json', 'seq_num': 1})]

DirectoryLoader

It loads all the documents in a directory, it uses UnstructuredLoader under the hood, by default.

We can use the glob parameter to control which files to load. Note that here it doesn’t load the .rst file or the .html files.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()

len(docs)

PyPDFLoader

The PyPDFLoader is a powerful tool in LangChain for seamlessly loading and processing PDF documents. Utilizing the pypdf library, it preserves the structure and layout of PDFs while extracting text content. You can load entire documents or individual pages, enabling granular processing. PyPDFLoader integrates with LangChain’s ecosystem, allowing advanced natural language tasks like question answering on PDF data.

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/sample_data/MachineLearning-Lecture01.pdf")
pages_1 = loader.load()

#Each page is a Document. A Document contains text (page_content) and metadata.

len(pages_1)

"""22"""

page = pages_1[0]

print(page.page_content[:500])

# Output

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is ju st spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning.  By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so I personally work in machine learning, and I' ve worked on it for about 15 years now, and I actually think that machine learning i

We can also use UnstructuredPDFLoader to load PDFs.

from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("/content/sample_data/MachineLearning-Lecture01.pdf")

data = loader.load()

We have OnlinePDFLoader to load online PDFs.

from langchain_community.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf")

data = loader.load()

data

# Output

[Document(page_content='3 2 0 2\n\nb e F 7\n\n]\n\nG A . h t a m\n\n[\n\n1 v 3 0 8 3 0 . 2 0 3 2 : v i X r a\n\nA WEAK (k, k)-LEFSCHETZ THEOREM FOR PROJECTIVE TORIC ORBI...

There are many more that utilize different sources…

# PyPDFium2Loader

from langchain_community.document_loaders import PyPDFium2Loader

loader = PyPDFium2Loader("text.pdf")

data = loader.load()

# PDFMinerLoader

from langchain_community.document_loaders import PDFMinerLoader

loader = PDFMinerLoader("text.pdf")

data = loader.load()

# PDFMinerPDFasHTMLLoader

from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

loader = PDFMinerPDFasHTMLLoader("text.pdf")

data = loader.load()[0]   # entire PDF is loaded as a single Document

# PyMuPDFLoader

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("text.pdf")

data = loader.load()

# Directory loader for PDF

from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("folder/")

docs = loader.load()

ArxivLoader

The ArxivLoader from LangChain is a game-changer for researchers and academics, providing direct access to the extensive arXiv repository of open-access publications. With just a few lines of code, you can fetch and process cutting-edge research papers, unlocking a wealth of knowledge.

from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="1605.08386", load_max_docs=2).load()

print(len(docs))
print()
print(docs[0].metadata)

# Output

1

{'Published': '2016-05-26', 'Title': 'Heat-bath random walks with Markov 
bases', 'Authors': 'Caprice Stanley, Tobias Windisch', 'Summary': 
'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\ndimension.'}

Docx2txtLoader

The Docx2txtLoader is a specialized tool designed to handle Microsoft Office Word documents (docx files). It allows you to effortlessly load and extract text content from Word files, making it a valuable asset for processing and analyzing documentation, reports, and other text-based materials stored in this widely-used format. With Docx2txtLoader, you can seamlessly integrate Word document data into your natural language processing pipelines and workflows within the LangChain ecosystem.

from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("example_data.docx")

data = loader.load()

data

# Output

[Document(page_content='Lorem ipsum dolor sit amet.', 
metadata={'source': 'file’}

WebBaseLoader

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")
docs = loader.load()
print(docs[0].page_content[:500])

UnstructuredFileLoader

Unlike loaders designed for specific formats like TextLoaderUnstructuredFileLoader automatically detects the file type you provide.

The loader utilizes the “unstructured” library under the hood. This library analyzes the file content and attempts to extract meaningful information based on the file type.

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader('/content/textfile.txt')

docs = loader.load()

docs

# Output

[Document(page_content='The rise of generative models\n\nGenerative AI refers to deep-learning models that can take raw data—say, all of Wikipedia or the collected works of Rembrandt—and “learn” to generate statistically probable outputs when prompted. At a high level, generative models encode a simplified representation of their training data and draw from it to create a new work that’s similar, but not identical, to the original data. Generative models have been used for years in statistics to analyze numerical data. The rise of deep learning, however, made it possible to extend them to images, speech, and other complex data types. Among the first class of AI models to achieve this cross-over feat were variational autoencoders, or VAEs, introduced in 2013. VAEs were the first deep-learning models to be widely used for generating realistic images and speech.', metadata={'source': '/content/textfile.txt'})]



loader = UnstructuredFileLoader('/content/textfile.txt', mode="elements")

docs = loader.load()

docs

# Output

[Document(page_content='The rise of generative models', metadata={'source': '/content/textfile.txt', 'file_directory': '/content', 'filename': 'textfile.txt', 'last_modified': '2024-03-09T01:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'Title'}),
 Document(page_content='Generative AI refers to deep-learning models that can take raw data—say, all of Wikipedia or the collected works of Rembrandt—and “learn” to generate statistically probable outputs when prompted. At a high level, generative models encode a simplified representation of their training data and draw from it to create a new work that’s similar, but not identical, to the original data. Generative models have been used for years in statistics to analyze numerical data. The rise of deep learning, however, made it possible to extend them to images, speech, and other complex data types. Among the first class of AI models to achieve this cross-over feat were variational autoencoders, or VAEs, introduced in 2013. VAEs were the first deep-learning models to be widely used for generating realistic images and speech.', metadata={'source': '/content/textfile.txt', 'file_directory': '/content', 'filename': 'textfile.txt', 'last_modified': '2024-03-09T01:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'})]



# pip install "unstructured[pdf]"

loader = UnstructuredFileLoader("text.pdf")

docs = loader.load()

docs

# Output

[Document(page_content='Event\n\nCommence Date\n\nReference\n\nPaul Kalkbrenner\n\n10 September,Satu
info@biletino.com', metadata={'source': 'text.pdf'})]

UnstructuredURLLoader

from langchain.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(urls = ['https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf'])
pages = loader.load()

len(pages)

"""1"""

pagee = pages[0]

print(pagee.page_content[:500])

# Output

MachineLearning-Lecture01

Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning.

By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so I personally work in machine learning, and I've worked on it for about 15 years now, and I actually think that machine learning is the most e

YouTube

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# ! pip install yt_dlp
# ! pip install pydub

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()
docs[0].page_content[0:500]

NotionDirectoryLoader

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

print(docs[0].page_content[0:200])

docs[0].metadata

Popular Posts

Spread the knowledge
 
  

Leave a Reply

Your email address will not be published. Required fields are marked *