Parsing SharePoint Pages

Parsing SharePoint pages with SPFx is straightforward and opens up a variety of exciting use cases. In this post, I’ll show how you can access SharePoint page content and parse it to extract various information.

Introduction
- Static content
- Dynamic content
SharePoint Online Page Structure - Generalized
- Horizontal-/ Vertical Section-Container
  - Web Part Controls
  - Rich Text Controls (OOTB Text Web Part control)
SPFx and the SharePoint Canvas
Microsoft Graph API for SharePoint Pages
Conclusion

Introduction

SharePoint Online is a modern web application. It is a React app, often called a Single Page Application (SPA). Before parsing SharePoint Online page content, it’s important to understand how SharePoint Online works.

The SharePoint web application splits content into two main categories:

Static content

Static content includes text from Web Parts (including custom SPFx Web Parts), properties panel settings, and content marked as indexable. This content is saved in the Canvas Content of SharePoint Online, accessible via the SharePoint REST API. This post will focus on Canvas Content.

Dynamic content

Dynamic content relies on two triggers:

It is based on execution logic, such as retrieving data from a list based on certain criteria.
It only executes as soon as the content becomes visible to the user. SharePoint takes care of all of this for us, enabling fast page loading. This type of content is not part of Canvas Content.

SharePoint Online Page Structure - Generalized

Let’s take a look at how the Canvas Control in SharePoint is structured.

Imagine a SharePoint page with the following content:

A banner containing the title
Various sections also called horizontal sections
Each section is divided into one or more vertical sections
Each vertical section contains either SharePoint’s standard text Web Parts or custom Web Parts

Horizontal-/ Vertical Section-Container

<div data-sp-canvascontrol="" data-sp-canvasdataversion="1.0">
- Holds each vertical/horizontal section.
- Attributes:
  - data-sp-canvascontrol: Always seems to be an empty string.
  - data-sp-canvasdataversion: Version number, generally “1.0”.
  - data-sp-controldata: JSON data that defines positioning and layout.
    - layoutIndex: Layout index, not relevant for this context.
    - zoneIndex: Index of the zone within the layout, representing a horizontal section.
    - zoneId: Unique GUID for each horizontal section.
    - sectionIndex: Defines which vertical section the content belongs to; if a section has a two-column layout, there will be two sectionIndex entries.
    - sectionFactor: Width factor. SharePoint divides layouts into 12 parts, so if you see 12, it represents a full-width layout, while a 4 would be a one-third layout.
    - controlIndex: Position index within the vertical section.

Web Part Controls

<div data-sp-webpart="" data-sp-webpartdataversion="X.X" data-sp-webpartdata="JSON">
- Attributes:
  - data-sp-webpart: Empty string, indicates a web part.
  - data-sp-webpartdataversion: Web part data version.
  - data-sp-webpartdata (JSON): Encodes web part properties.
    - Common JSON fields:
      - id: Unique web part ID.
      - title: Display name of the web part.
      - description: Description text.

Rich Text Controls (OOTB Text Web Part control)

<div data-sp-rte="">
- Attributes:
  - data-sp-rte: Flags the container as a Rich Text Editor (RTE) control.
- Content:
  - Contains the text content.

Important notes on the SharePoint Canvas:

There are only two hierarchy levels: the first level defines the Vertical / Horizontal Section-Container, and the second contains the Web Parts included.
Because it’s a flat structure, SharePoint uses the properties zoneIndex, sectionIndex, and controlIndex to build the visual layout during rendering.

Now that we have a basic understanding of the SharePoint Canvas, let’s see how to access it and what we can do with it.

SPFx and the SharePoint Canvas

Accessing the SharePoint Canvas

I enjoy using PnP JS , so the following example is based on PnP JS:

const page = await this.sp.web.lists.getById('listid_of_your_sitepages_library').items.getById('listitemid_of_your_page')
    .select('CanvasContent1', 'FileRef')();
const content: string = page.CanvasContent1;

That’s it; now you have all the content in your content variable.

Preparing the SharePoint Canvas for Processing

The best way to read the SharePoint Canvas is to use the built-in JavaScript Web API class DOMParser , which is widely supported in browsers.

const parser = new DOMParser();
const doc = parser.parseFromString(content, 'text/html');

Now, we’re ready to start parsing.

Extracting Information from the SharePoint Page

Use Case 1: Accessing all Headings (h2/h3/h4) in a SharePoint Page

Suppose we want to access and list all headings on a SharePoint page. The code is quite simple at this point:

const headings = Array.from(doc.querySelectorAll('h2, h3, h4'));

We have full support in our browser for the DOM API, so we can leverage standardized and optimized methods in our code.

Reminder: If your hX tags are dynamic content (for example, generated by a custom SPFx web part), they won’t appear in the Canvas content.

Use Case 2: Checking if a Web Part Exists

Suppose we want to check if a specific Web Part is available on the page:

let myWebPartFound = false;
const webPartElement = doc.querySelector('[data-sp-webpartdata*="mywebpartguid"]');
if (webPartElement) {
    myWebPartFound = true;
}

We can go further and also access all the properties available for the Web Part.

For a comprehensive implementation on how to read the SharePoint Canvas and allow SPFx solutions to share information, check out the In Page Navigation solutions from PuntoBello:

Microsoft Graph API for SharePoint Pages

Microsoft began rolling out the Microsoft Graph API for SharePoint Pages in April 2024, enabling programmatic page manipulation.

Conclusion

The SharePoint Canvas is a valuable source of information that can be easily leveraged to build solutions. Whenever possible, rely on standard Browser APIs to process it.