Load Data

Start with LangChain for loading your data instead of manually coding your own pipeline. LangChain likely already provides the functionalities you need.

The examples use vector-enabled Astra DB Serverless database for the vector store and assume you have one available. If not, see Notebook Prerequisites.

Load PDF from file

  1. Use PyPDFLoader to load PDFs as an array of Document objects. Each Document object contains the page content and metadata - in this case, the page number.

    from langchain_community.document_loaders import PyPDFLoader
    
    loader = PyPDFLoader("example-data/example-paper.pdf")
    docs = loader.load()
    
    print(docs[0])
  2. This will print the first page of your PDF and its associated metadata, metadata={'source': 'example-data/example-paper.pdf', 'page': 0}.

Try changing the page number in print(docs[0]) and see how the metadata tracks to your PDF pages.

Load multiple PDFs from a directory

To load a directory containing multiple pdfs, use the PyPDFDirectoryLoader class:

from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./example-data/")
docs = loader.load()

print(docs[0])

Load local JSON file

JSONLoader uses the jq library to parse JSON files. If you’re familiar with jq, you’ll probably find this loader very familiar.

  1. Install jq with pip install jq.

  2. Load a local JSON file of HTTP responses as an array of Document objects.

    • Python

    • Result

    from langchain_community.document_loaders import JSONLoader
    import json
    from pathlib import Path
    from pprint import pprint
    
    file_path='./response.json'
    data = json.loads(Path(file_path).read_text())
    
    pprint(data)
    [{'event': {'id': 0, 'url': 'https://httpbin.org/stream/3'},
      'metadata': {'location': 'New York, USA',
                   'source': 'WebApplication',
                   'tags': ['streaming', 'data'],
                   'timestamp': '2023-12-20T12:00:00Z',
                   'userID': 'user001'}},
     {'event': {'id': 1, 'url': 'https://httpbin.org/stream/3'},
      'metadata': {'location': 'New York, USA',
                   'source': 'WebApplication',
                   'tags': ['streaming', 'data'],
                   'timestamp': '2023-12-20T12:05:00Z',
                   'userID': 'user002'}},
     {'event': {'id': 2, 'url': 'https://httpbin.org/stream/3'},
      'metadata': {'location': 'New York, USA',
                   'source': 'WebApplication',
                   'tags': ['streaming', 'data'],
                   'timestamp': '2023-12-20T12:10:00Z',
                   'userID': 'user003'}}]

Extract values to page_content

  1. Extract the values from the url field within the event key of the JSON data. This will include the url values in the page_content of the Document object.

    • Python

    • Result

    from langchain_community.document_loaders import JSONLoader
    import json
    from pathlib import Path
    from pprint import pprint
    
    loader = JSONLoader(
        file_path='./response.json',
        jq_schema='.[].event.url',
        text_content=False)
    
    data = loader.load()
    
    pprint(data)
    [Document(page_content='https://httpbin.org/stream/3', metadata={'source': '/response.json', 'seq_num': 1}),
     Document(page_content='https://httpbin.org/stream/3', metadata={'source': '/response.json', 'seq_num': 2}),
     Document(page_content='https://httpbin.org/stream/3', metadata={'source': '/response.json', 'seq_num': 3})]
  2. You can see that the page_content of each Document object now contains the url values.

Extract metadata

metadata_func is responsible for identifying which values should be included in the Document object’s metadata. Modifying its behavior allows you to enrich your Document objects with additional information from the source data.

  1. Instruct metadata_func to extract values of event_id and event_url for each event record and add them to your Document metadata. In this case, you want to include both the event_id and event_url in the metadata.

    • Python

    • Result

    from langchain_community.document_loaders import JSONLoader
    import json
    from pathlib import Path
    from pprint import pprint
    
    def metadata_func(record: dict, metadata: dict) -> dict:
    
        metadata["event_id"] = record.get("id")
        metadata["event_url"] = record.get("url")
    
        return metadata
    
    loader = JSONLoader(
        file_path='./response.json',
        jq_schema='.[].event',
        metadata_func=metadata_func,
        text_content=False
    )
    
    data = loader.load()
    
    pprint(data)
    [Document(page_content='{"url": "https://httpbin.org/stream/3", "id": 0}', metadata={'source': 'response.json', 'seq_num': 1, 'event_id': 0, 'event_url': 'https://httpbin.org/stream/3'}),
     Document(page_content='{"url": "https://httpbin.org/stream/3", "id": 1}', metadata={'source': 'response.json', 'seq_num': 2, 'event_id': 1, 'event_url': 'https://httpbin.org/stream/3'}),
     Document(page_content='{"url": "https://httpbin.org/stream/3", "id": 2}', metadata={'source': 'response.json', 'seq_num': 3, 'event_id': 2, 'event_url': 'https://httpbin.org/stream/3'})]
  2. You can see that both event_id and event_url are now included in the metadata of each Document object.

Alternatives

Load HTML

See the LangChain HTMLLoader for loading HTML documents.

Load data from S3

See the LangChain S3 loader for loading Document objects from an AWS S3 file bucket.

Load data from Google Cloud Storage

See the LangChain Google Cloud Storage file loader for loading Document objects from a Google Cloud Storage file object.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com