waives.io

The CloudHub360 Developer Hub

Welcome to the CloudHub360 developer hub. You'll find comprehensive guides and documentation to help you start working with CloudHub360 as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started

Get access token

 
posthttps://api.waives.io/oauth/token

Body Params

client_id
string
required

The Client ID of an API Client created using the dashboard

client_secret
string
required

The Client Secret of an API Client created using the dashboard

The access_token property is your access token. The expires_in property specifies the number of seconds in which this access token will expire. You should make a request for a new access token at this point, or a little before.

With every request to the API you should then specify the Authorization header as follows:

Authorization: Bearer <MY_ACCESS_TOKEN>
POST /oauth/token
Content-Type: 'application/x-www-form-urlencoded'

client_id=<MY_CLIENT_ID>&client_secret=<MY_CLIENT_SECRET>
curl https://api.cloudhub360.com/oauth/token \
-d client_id=<MY_CLIENT_ID> \
-d client_secret=<MY_CLIENT_SECRET> \
-X POST
A binary file was returned

You couldn't be authenticated

{
  "access_token": "<MY_ACCESS_TOKEN>",
  "token_type": "Bearer",
  "expires_in": 86400
}
 

Create document (upload)

Create a new document and add a file supplied in the request body to it. The document can then be read, classified or have data extracted from it.

 
posthttps://api.waives.io/documents

The request body should contain the binary contents of the document's file.

The newly created document resource is returned, along with a 201 Created status. The document resource includes the document's ID, which can then be used with the Get, Read, Classify, Extract Document Data, Get Redacted PDF and Delete endpoints.

The Supported File Types article contains details of all file types supported by Waives, and the maximum file size.

Files embedded resource

The document resource contains an embedded files resource which includes details of the file that the document was created from.

"files": [
  {
    "id": "p3g-T4kf4EeNQ8baNLA8Uw",
    "file_type": "PDF:ImagePlusText",
    "size": 73136,
    "sha256": "f3ee28bbc30e789202e0f84bcbb187c5abc88d54e081bb3fa8abfa8f1a4603ea"
  }
]

The properties are as follows:

  • id: A unique identifier for this file.
  • file_type: The type of the file as determined by the API by examining the contents of the file. This will have one of the values listed in the table below.
  • size: The size of the file in bytes.
  • sha256: The SHA-256 hash of the file contents.

It is best practice to calculate your own values for size, sha256 and file_type (which in most cases will be a static value) of the file you are submitting and compare these to the values in the response in order to ensure that the file was not corrupted during transmission.

Value of file_type
Description

PDF:ImageOnly

PDF format file comprised only of full-page images, typically indicating a scanned document

PDF:ImagePlusText

PDF format file that has full-page images with 'hidden' text, typically indicating a scanned document that has had OCR used on it

PDF:Misc

PDF format file that has content other than full-page images, typically indicating a PDF generated from electronic content

Image:TIFF

An image in TIFF Format

Image:JPEG

An image in JPEG format

Image:JPEG2000

An image in JPEG-2000 format

OpenXML:Word

Microsoft Office Word (.docx) documents

OpenXML:Spreadsheet

Microsoft Office Excel (.xlsx) documents

OpenXML:Presentation

Microsoft Office PowerPoint (.pptx) documents

Text:ANSI

Plain text file with text in 8-bit ANSI format

Text:UTF8

Plain text file with text in UTF-8 format

Text:UTF16
Text:UTF16_BigEndian

Plain text file with text in UTF-16 format or UTF-16 (big-endian) format

Email:MIME

An Email in MIME (.eml) format

Email:MSG

An Email in Microsoft Outlook (.msg) format

HTML:ANSI

HTML file encoded in 8-bit ANSI format

HTML:UTF8

HTML file encoded in UTF-8 format

HTML:UTF16
HTML:UTF16_BigEndian

HTML file with text in UTF-16 format or UTF16 (big-endian) format

If you know the type of the file and want to validate that the API concurs, you can set the Content-Type header to the MIME-type of the file as shown in the table below. If the file type does not match then the request will be rejected with a 415 response.

File Type
Content-Type header

PDFs

application/pdf

Microsoft Office Word (.docx) documents

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Microsoft Office Excel (.xlsx) documents

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Microsoft Office PowerPoint (.pptx) documents

application/vnd.openxmlformats-officedocument.presentationml.presentation

TIFF image

image/tiff

JPEG image

image/jpeg

JPEG 2000 image

image/jp2

Text document

text/plain

Email message (.eml)

message/rfc822

Outlook email message (.msg)

application/vnd.ms-outlook

HTML document

text/html

RESPONSES

201 The Document was created
400 There is no file supplied in the body
401 There is no Authorization header or the access token is invalid
403 You have reached your maximum number of simultaneous documents
413 The file supplied in the body is too large
415 The Content-Type contains an unsupported type or does not match the actual contents of the file

POST /documents HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
Content-Type: application/pdf

Raw file content
curl https://api.cloudhub360.com/documents \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/pdf" -X POST \
-T "myfile.pdf"
A binary file was returned

You couldn't be authenticated

{
  "id": "DOCUMENT_ID",
  "_links": {
    "document:classify": {
      "href": "/documents/vXSgaG1K9ke5LN2ufngydQ/classify/{classifier_name}",
      "templated": true
    },
    "self": {
      "href": "/documents/vXSgaG1K9ke5LN2ufngydQ"
    }
  },
  "_embedded": {
    "files": [
      {
        "id": "d4dWqO-hzUaV4oa8-1qY3w",
        "file_type": "PDF:ImagePlusText",
        "size": 93434,
        "sha256": "89feb305ae3f304abdc926e165b68c2e373306765a68c2adda243eac1b1c0f52"
      }
    ]
  }
}
 

Create document (import)

Create a new document and add a file available at a specified URL to it. The document can then be read, classified or have data extracted from it.

 
posthttps://api.waives.io/documents

Headers

Content-Type
string

application/json

The request body should specify the URL from where Waives can download the contents of the document's file. The Content-Type header must be set to application/json; if it is excluded, the request will be treated as an upload request rather than an import.

Only HTTP and HTTPS schemes are allowed (HTTPS is strongly recommended).

The download of the file must succeed within 10 seconds, otherwise a 422 Unprocessable Entity is returned. The 422 response is returned in a few cases, such as when the download fails or the JSON body does not match the required schema. The reason for the 422 response is provided in the response body.

The newly created document resource is returned, along with a 201 Created status. The document resource includes the document's ID, which can then be used with the Get, Read, Classify, Extract Document Data, Get Redacted PDF and Delete endpoints.

The Supported File Types article contains details of all file types supported by Waives, and the maximum file size.

Files embedded resource

The document resource contains an embedded files resource which includes details of the file that the document was created from.

"files": [
  {
    "id": "p3g-T4kf4EeNQ8baNLA8Uw",
    "file_type": "PDF:ImagePlusText",
    "size": 73136,
    "sha256": "f3ee28bbc30e789202e0f84bcbb187c5abc88d54e081bb3fa8abfa8f1a4603ea"
  }
]

The properties are as follows:

  • id: A unique identifier for this file.
  • file_type: The type of the file as determined by the API by examining the contents of the file. This will have one of the values listed in the table below.
  • size: The size of the file in bytes.
  • sha256: The SHA-256 hash of the file contents.

It is best practice to calculate your own values for size, sha256 and file_type (which in most cases will be a static value) of the file you are submitting and compare these to the values in the response in order to ensure that the file was not corrupted during transmission.

Value of file_type
Description

PDF:ImageOnly

PDF format file comprised only of full-page images, typically indicating a scanned document

PDF:ImagePlusText

PDF format file that has full-page images with 'hidden' text, typically indicating a scanned document that has had OCR used on it

PDF:Misc

PDF format file that has content other than full-page images, typically indicating a PDF generated from electronic content

Image:TIFF

An image in TIFF Format

Image:JPEG

An image in JPEG format

Image:JPEG2000

An image in JPEG-2000 format

OpenXML:Word

Microsoft Office Word (.docx) documents

OpenXML:Spreadsheet

Microsoft Office Excel (.xlsx) documents

OpenXML:Presentation

Microsoft Office PowerPoint (.pptx) documents

Text:ANSI

Plain text file with text in 8-bit ANSI format

Text:UTF8

Plain text file with text in UTF-8 format

Text:UTF16
Text:UTF16_BigEndian

Plain text file with text in UTF-16 format or UTF-16 (big-endian) format

Email:MIME

An Email in MIME (.eml) format

Email:MSG

An Email in Microsoft Outlook (.msg) format

HTML:ANSI

HTML file encoded in 8-bit ANSI format

HTML:UTF8

HTML file encoded in UTF-8 format

HTML:UTF16
HTML:UTF16_BigEndian

HTML file with text in UTF-16 format or UTF16 (big-endian) format

If a Content-Type header is returned in the response when downloading the specified file, Waives will analyse the contents of file and validate that the file type matches the MIME-type in the header. The valid Content-Type and file type combinations are specified in the table below.

If the file type does not match the header value then the request will be rejected with a 415 response.

File Type
Content-Type header

PDFs

application/pdf

Microsoft Office Word (.docx) documents

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Microsoft Office Excel (.xlsx) documents

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Microsoft Office PowerPoint (.pptx) documents

application/vnd.openxmlformats-officedocument.presentationml.presentation

TIFF image

image/tiff

JPEG image

image/jpeg

JPEG 2000 image

image/jp2

Text document

text/plain

Email message (.eml)

message/rfc822

Outlook email message (.msg)

application/vnd.ms-outlook

HTML document

text/html

RESPONSES

201 The Document was created
400 The request is badly formed or invalid
401 There is no Authorization header or the access token is invalid
403 You have reached your maximum number of simultaneous documents
415 The Content-Type is specified and not set to application/json
422 There was a problem downloading the specified file (see the error in the response for details of the specific error).

POST /documents HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
Content-Type: application/json

{
  "url": "https://my.filestore.com/path/to/document.pdf"
}
curl https://api.cloudhub360.com/documents \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/pdf" -X POST \
-T "myfile.pdf"
A binary file was returned

You couldn't be authenticated

{
  "id": "DOCUMENT_ID",
  "_links": {
    "document:classify": {
      "href": "/documents/vXSgaG1K9ke5LN2ufngydQ/classify/{classifier_name}",
      "templated": true
    },
    "self": {
      "href": "/documents/vXSgaG1K9ke5LN2ufngydQ"
    }
  },
  "_embedded": {
    "files": [
      {
        "id": "d4dWqO-hzUaV4oa8-1qY3w",
        "file_type": "PDF:ImagePlusText",
        "size": 93434,
        "sha256": "89feb305ae3f304abdc926e165b68c2e373306765a68c2adda243eac1b1c0f52"
      }
    ]
  }
}
 

Get document

Get the details of an existing document

 
gethttps://api.waives.io/documents/document_id

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

Details of the document resource and its files are returned.

RESPONSES

200 The document exists, and its details are included in the response
401 There is no Authorization header or the access token is invalid
404 There is no document with the specified ID

GET /documents/DOCUMENT_ID HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/documents/DOCUMENT_ID \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "id": "DOCUMENT_ID",
  "_links": {
    "document:classify": {
      "href": "/documents/DOCUMENT_ID/classify/{classifier_name}",
      "templated": true
    },
    "self": {
      "href": "/documents/DOCUMENT_ID"
    }
  },
  "_embedded": {
    "files": [
      {
        "id": "d4dWqO-hzUaV4oa8-1qY3w",
        "file_type": "PDF:ImagePlusText",
        "size": 93434,
        "sha256": "89feb305ae3f304abdc926e165b68c2e373306765a68c2adda243eac1b1c0f52"
      }
    ]
  }
}
 

Get all documents

Get the details of all the existing documents

 
gethttps://api.waives.io/documents

Details of all the document resources and their files are returned. If no documents exist in the account, the documents property is an empty array.

RESPONSES

200 The details of all documents are contained in the response
401 There is no Authorization header or the access token is invalid

GET /documents HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/documents \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "documents": [
  {
    "id": "vXSgaG1K9ke5LN2ufngydQ",
    "_links": {
      "document:classify": {
        "href": "/documents/vXSgaG1K9ke5LN2ufngydQ/classify/{classifier_name}",
        "templated": true
      },
      "self": {
        "href": "/documents/vXSgaG1K9ke5LN2ufngydQ"
      }
    },
    "_embedded": {
      "files": [
        {
          "id": "d4dWqO-hzUaV4oa8-1qY3w",
          "file_type": "PDF:ImagePlusText",
          "size": 93434,
          "sha256": "89feb305ae3f304abdc926e165b68c2e373306765a68c2adda243eac1b1c0f52"
        }
      ]
    }
  },
  {
    "id": "UBBEQVuIo0a0xJmlBLBpJA",
    "_links": {
      "document:classify": {
        "href": "/documents/UBBEQVuIo0a0xJmlBLBpJA/classify/{classifier_name}",
        "templated": true
      },
      "self": {
        "href": "/documents/UBBEQVuIo0a0xJmlBLBpJA"
      }
    },
    "_embedded": {
      "files": [
        {
          "id": "_6pThnqrnEez4k5jHx68pw",
          "file_type": "PDF:ImagePlusText",
          "size": 82637,
          "sha256": "f3ee28bbc30e789202e0f84bcbb187c5abc88d54e081bb3fa8abfa8f1a4603ea"
        }
      ]
    }
  }    
 ]
}
{
  "documents": []
}
 

Read (OCR) document

OCR the specified document.

 
puthttps://api.waives.io/documents/document_id/reads

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

Headers

Authorization
string

The OAuth 2.0 Bearer Token provided during token exchange

Note that only documents created from PDFs containing images (i.e. scan to PDF) and digitally-created PDFs are supported for reading.

Every page in a multi-page PDF is processed and included in read results.

For small documents reading will usually be very quick, but for very large documents you should expect response time to be up to tens of seconds.

If you read a document created from a digitally-created PDF, the PDF is rendered as an image (or images) and then OCR is performed on the resulting image(s).

If you try to read a document for which reading is not supported, such as a Microsoft Office or text document you will receive a 422 Unprocessable Entity response.

Once this request is complete, you can obtain the OCR results as either a searchable PDF, with OCR text embedded, or as raw text by making a GET request to the same URL.

RESPONSES

201 The Document was read. The results are available from the Get Read Results endpoint.
400 No Document ID is specified
401 There is no Authorization header or the access token is invalid
404 The specified Document does not exist
422 The content type of the specified document is not supported for this operation

PUT /documents/{document_id}/reads HTTP/1.1
Authorization: Bearer {token}
Host: api.waives.io
curl https://api.cloudhub360.com/documents/DOCUMENT_ID/reads \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X PUT
A binary file was returned

You couldn't be authenticated

{
    "_links": {
        "self": {
            "href": "/documents/{document_id}/reads"
        },
        "parent": {
            "href": "/documents/{document_id}"
        }
    }
}
{
    "message": "The content type application/vnd.openxmlformats-officedocument.wordprocessingml.document is not currently supported for reading."
}
 

Get read (OCR) results

Get the results of a read request, as a searchable PDF or raw OCR text

 
gethttps://api.waives.io/documents/document_id/reads

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

Headers

Accept
string

The file format in which you would like the document's read results. Supported types are described below.

Authorization
string

The OAuth 2.0 Bearer Token provided during token exchange

Before you make a request to this endpoint you should make a request to Read (OCR) document, otherwise you will receive a 404 Not Found response.

You must set an Accept header with a value specifying the format in which the OCR results should be returned, as follows:

Format
Accept header value

Searchable PDF, with OCR text embedded

application/pdf

Raw OCR text

text/plain

Waives document format (use this only in conjunction with Waives support)

application/vnd.waives.resultformats.read+zip

The results are returned in the body of the response in the format requested.

Creating Searchable PDFs from TIFFs or JPEGs

Creation of Searchable PDFs from TIFF, JPEG and JPEG2000 file will be available very soon - if you need this then please get in touch with us via support@waives.io.

RESPONSES

200 The results are available in the format requested and returned in the response body
400 No Document ID is specified
401 There is no Authorization header or the access token is invalid
404 The specified Document does not exist or a Read (OCR) document request has not been made for this Document.

GET /documents/{document_id}/reads HTTP/1.1
Accept: application/pdf
Authorization: Bearer {token}
Host: api.waives.io
curl https://api.waives.io/documents/DOCUMENT_ID/reads \
-H "Accept: CONTENT_TYPE" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

Results in response body
 

Classify document

Classify the specified document using a classifier and return its document type.

 
posthttps://api.waives.io/documents/document_id/classify/classifier_name

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

classifier_name
string
required

The name of the classifier to use, as specified when calling Create classifier.

Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before classification is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be up to tens of seconds.

The classification result contains several properties with different purposes. You should take care to understand these. The Classification results article explains all the properties and how to interpret them.

If you haven't added samples to the classifier

If you use a classifier before you have added samples to it you will get classification results where the document type and document type scores are null.

RESPONSES

200 The Document was classified
400 No Document ID is specified or no Classifier name was specified
401 There is no Authorization header or the access token is invalid
404 The specified Document does not exist or the specified Classifier does not exist

POST /documents/DOCUMENT_ID/classify/CLASSIFIER_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/documents/DOCUMENT_ID/classify/CLASSIFIER_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X POST
A binary file was returned

You couldn't be authenticated

{
  "document_id": "9oQjQON2hUCLGRYfplDkgA",
  "classification_results": {
    "document_type": "Agreements",
    "is_confident": true,
    "relative_confidence": 1.4373222
    "document_type_scores": [
      {
        "document_type": "Agreements",
        "score": 70.0982361
      },
      {
        "document_type": "Expenses",
        "score": 29.9017639
      }
    ]
  }
}
 

Extract document data

Extract data from the specified document using an extractor.

 
posthttps://api.waives.io/documents/document_id/extract/extractor_name

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

extractor_name
string
required

The name of the extractor to use, as specified when calling Create extractor.

Headers

Accept
string

The type of response to return (extraction results or a redaction request)

Overview

This endpoint extracts data from the specified document using an extractor. By default it returns details of the extracted data, but it can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

On-demand reading

Note that documents created from image files (TIFF, JPEG, JPEG2000) and PDFs that contain only images are automatically read (OCRed) before extraction is performed. For small documents this will usually be very quick, but for very large documents you should expect response time to be longer.

Extracting invoice data

To extract invoice data from UK invoices you can use the built-in extractor named waives.invoices.gb. For more information, see Extracting invoice data.

Extracted data results

If the Accept header is not set or is application/vnd.waives.resultformats.extractdata+json, then the response contains details of the data extracted from the document.

The field_results section of the response contains the data extracted from the document. This is an array containing one element for each field in the extractor configuration. Each field looks like this:

{
  "field_name": "Amount",
  "result": {
    "text": "$5.50",
    "value": null,
    "rejected": false,
    "reject_reason": "None",
    "areas": [
      {
        "top": 558.7115,
        "left": 276.48,
        "bottom": 571.1989,
        "right": 298.58,
        "page_number": 1
      }
    ],
    "proximity_score": 100,
    "match_score": 100,
    "text_score": 100
  },
  "alternatives": null,
  "tabular_results": null,
  "rejected": false,
  "reject_reason": "None"
}

The properties of the field are:

  • field_name: The name of the field
  • result: The primary result for the field (null for a table field)
  • rejected: A flag indicating whether the field results should be considered potentially invalid
  • reject_reason: The reason for rejection of the field
  • alternatives: Secondary (alternative) results for the field

The primary result, and any alternative results are structured like this:

{
  "text": "$5.50",
  "value": null,
  "rejected": false,
  "reject_reason": "None",
  "areas": [
    {
      "top": 558.7115,
      "left": 276.48,
      "bottom": 571.1989,
      "right": 298.58,
      "page_number": 1
    }
  ],
  "proximity_score": 100,
  "match_score": 100,
  "text_score": 100
}

The properties of a result are:

  • text: The text of the result
  • value: The value as a non-text type (e.g. Decimal or DateTime), if available
  • rejected: A flag indicating whether the result should be considered potentially invalid
  • reject_reason: The reason for rejection of the result
  • areas: A list of areas from which the result originated
  • proximity_score: A score indicating how well any proximity rules in the configuration for this field have been met (how close this result is, or isn't, to particular content nearby)
  • match_score: A score indicating how well the text matched the search criteria
  • text_score: A score indicating the OCR confidence assigned to the actual text that was extracted

The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).

Score properties value range from 0 to 100, where 100 is a perfect score.

Getting a response in redaction request format

This endpoint can also be used to obtain a response that can be passed directly to the Get redacted PDF endpoint to get a PDF with all extracted data redacted.

If the Accept header is application/vnd.waives.requestformats.redact+json then the response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first.

One redaction mark is created for every non-empty result and alternative result for every field.

Each redaction mark is labelled with the extraction field it came from to help you if you want to edit it, for example by removing marks for specific fields.

RESPONSES

200 Data was extracted from the document and results are in the response
400 No document ID is specified or no extractor name was specified
401 There is no Authorization header or the access token is invalid
404 The specified document does not exist or the specified extractor does not exist

POST /documents/DOCUMENT_ID/extract/EXTRACTOR_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/documents/DOCUMENT_ID/classify/CLASSIFIER_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X POST
A binary file was returned

You couldn't be authenticated

{
  "document": {
    "page_count": 1,
    "pages": [
      {
        "page_number": 1,
        "width": 611,
        "height": 1008
      }
    ]
  },
  "field_results": [
    {
      "field_name": "Amount",
      "result": {
        "text": "$5.50",
        "value": null,
        "rejected": false,
        "reject_reason": "None",
        "areas": [
          {
            "top": 558.7115,
            "left": 276.48,
            "bottom": 571.1989,
            "right": 298.58,
            "page_number": 1
          }
        ],
        "proximity_score": 100,
        "match_score": 100,
        "text_score": 100
      },
      "alternatives": null,
      "tabular_results": null,
      "rejected": false,
      "reject_reason": "None"
    }
  ]
}
{
  "marks": [
    {
      "name": "Amount",
      "area": {
        "top": 558.7115,
        "left": 276.48,
        "bottom": 571.1989,
        "right": 298.58,
        "page_number": 1        
      },  
    }
  ],
  "apply_marks": true,
  "bookmarks": [
    {
      "text": "Amount",
      "page_number": 1
    }
  ]
}
 

Delete document

Delete an existing document

 
deletehttps://api.waives.io/documents/document_id

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

RESPONSES

204 The document was deleted, or did not already exist
401 There is no Authorization header or the access token is invalid

DELETE /documents/DOCUMENT_ID HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/documents/DOCUMENT_ID \
-H "Authorization: Bearer ACCESS_TOKEN"
-X DELETE
A binary file was returned

You couldn't be authenticated

No response examples available
 

Get redacted PDF (beta)

Get a PDF of the document with specific areas redacted

 
posthttps://api.waives.io/documents/document_id/redact

Path Params

document_id
string
required

The ID of the document, as returned by a request to Create document

Body Params

marks
array

An array of areas of the document to redact

apply_marks
boolean

Whether to make the redactions permanent and remove associated text from the PDF. (Default: true)

bookmarks
string

An array of bookmarks to add to the document

Beta endpoint

This endpoint is currently in beta. It is functionally complete, but performance is not yet optimised. You should expect response times in the order of 2000ms. Redacted PDFs do not maintain the compression of the file the document was created from and thus will increase in size.

Supported file types

Redaction is supported for documents created from PDFs or TIFFs.

Redaction request

Adding Marks

The marks property is an array containing one element for each redaction to be made to the document. Each mark object looks like this:

{
  "area": {
    "top": 98,
    "left": 157,
    "bottom": 104,
    "right": 187,
    "page_number": 1
   }
}

The area co-ordinates are relative to the top left of the page and are in points (1/72 inch). The page number is one-based (i.e. the first page of a document is page 1).

Applying redactions

The apply_marks property controls how redactions are made in the PDF.

If apply_marks is true (the default) then as well as a redaction object being added to the PDF, the image underlying each field area is replaced with a black rectangle and any text in that area is removed. The redaction is permanent and cannot be undone if the PDF is loaded into a PDF editor such as Adobe Acrobat.

If apply_marks is false then a redaction object is added to the PDF but the image and any text in the PDF are left unaltered. The redaction can be reviewed and accepted or deleted in a PDF editor such as Adobe Acrobat. Accepting the redaction in that tool will alter the image and remove the text.

Adding bookmarks

The bookmarks property is an array containing one element for each bookmark to add to the PDF. This is an array containing one element for each area to redact. Each bookmark object looks like this:

{
  "text": "Address",
  "page_number": 2
}

The text property specifies the text of the bookmark that will be added. The page_number specifies the page in the document that the bookmark will link to.

Beta

This endpoint is currently in beta. It is functionally complete, but performance is not yet optimised. You should expect response times in the order of 2000ms. Redacted PDFs do not maintain the compression of the file the document was created from and thus will increase in size.

Creating a redaction request based on extraction results

In most cases you will want to redact areas corresponding to the locations of data extracted using the Extract document data endpoint. Rather than building a redaction request manually you can request a response from that endpoint that you can pass straight to this endpoint.

Simply make a request to the Extract document data endpoint, specifying an Accept header with the value application/vnd.waives.requestformats.redact+json. The response you receive will be a redaction request that will redact all data extracted from the document. You can either send this directly in a request to this endpoint or edit it first. Each redaction field is labelled with the extraction field it came from to help you if you want to edit it, removing some fields for example.

PDF Text

The PDF returned in the response will contain any text generated by a read (OCR) operation due to any of the Read, Classify or Extract operations being requested for this document.

RESPONSES

200 The document was redacted and the PDF is in the response body
400 One or more properties in the request was invalid. See the response contents for details.
401 There is no Authorization header or the access token is invalid
404 The specified document does not exist
415 Redaction is not supported for documents created from this document's file type

POST /documents/DOCUMENT_ID/redact HTTP/1.1
Authorization: Bearer ACCESS_TOKEN

{
  "marks": [
    {
      "area": {
        "top": 98,
        "left": 157,
        "bottom": 104,
        "right": 187,
        "page_number": 1
      },      
    }
  ],
  "apply_marks": false,
  "bookmarks": [
    {
      "text": "Credit card number",
      "page_number": 1
    }
  ]
}
curl https://api.cloudhub360.com/documents/DOCUMENT_ID/redact \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X POST
A binary file was returned

You couldn't be authenticated

PDF in response body
 

Create classifier

Create a new classifier from an existing classifier or from scratch. Samples must be added to an empty classifier before it can be used to classify documents.

 
posthttps://api.waives.io/classifiers/classifier_name

Path Params

classifier_name
string
required

The desired name for the classifier

Body Params

file

If you have trained a classifier using Document Studio, include the saved .clf file in the request

Headers

Content-Type
string

The MIME type of the request body. Supported values are application/vnd.waives.classifier+zip and application/octet-stream.

If you do not include a classifier file in your Create Classifier request, an empty classifier will be created with the name you specified. You must add samples to the empty classifier before you can use it for classification. If you try to classify a document using a classifier with no samples added you will get classification results where the document type and document type scores are null. For more information see About Classifiers.

RESPONSES

201 The classifier was created
400 There is already a classifier with the specified name
401 There is no Authorization header or the access token is invalid

POST /classifiers/CLASSIFIER_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/classifiers/CLASSIFIER_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X POST
A binary file was returned

You couldn't be authenticated

{
  "name": "CLASSIFIER_NAME",
  "_links": {
    "self": {
      "href": "/classifiers/CLASSIFIER_NAME"
    },
    "classifier:add_sample": {
      "href": "/classifiers/CLASSIFIER_NAME/sample/{document_type}",
      "templated": true
    },
    "classifier:add_samples_from_zip": {
      "href": "/classifiers/CLASSIFIER_NAME/samples"
    },
    "classifier:get": {
      "href": "/classifiers/CLASSIFIER_NAME"
    }
  }
}
 

Get classifier

Get the details of an existing classifier

 
gethttps://api.waives.io/classifiers/classifier_name

Path Params

classifier_name
string
required

The name of the classifier

RESPONSES

200 The classifier exists, and its details are included in the response
401 There is no Authorization header or the access token is invalid
404 There is no classifier with the specified name

GET /classifiers/CLASSIFIER_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/classifiers/CLASSIFIER_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "name": "CLASSIFIER_NAME",
  "_links": {
    "self": {
      "href": "/classifiers/CLASSIFIER_NAME"
    },
    "classifier:add_sample": {
      "href": "/classifiers/CLASSIFIER_NAME/sample/{document_type}",
      "templated": true
    },
    "classifier:add_samples_from_zip": {
      "href": "/classifiers/CLASSIFIER_NAME/samples"
    },
    "classifier:get": {
      "href": "/classifiers/CLASSIFIER_NAME"
    }
  }
}
 

Get all classifiers

Get the details of all classifiers

 
gethttps://api.waives.io/classifiers

RESPONSES

200 The details of all classifiers are contained in the response
401 There is no Authorization header or the access token is invalid

GET /classifiers HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/classifiers \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "classifiers": [
    {
      "name": "hr-documents",
      "_links": {
        "self": {
          "href": "/classifiers/hr-documents"
        },
        "classifier:add_sample": {
          "href": "/classifiers/hr-documents/sample/{document_type}",
          "templated": true
        },
        "classifier:add_samples_from_zip": {
          "href": "/classifiers/hr-documents/samples"
        },
        "classifier:get": {
          "href": "/classifiers/hr-documents"
        }
      }
    },
    {
      "name": "finance-documents",
      "_links": {
        "self": {
          "href": "/classifiers/finance-documents"
        },
        "classifier:add_sample": {
          "href": "/classifiers/finance-documents/sample/{document_type}",
          "templated": true
        },
        "classifier:add_samples_from_zip": {
          "href": "/classifiers/finance-documents/samples"
        },
        "classifier:get": {
          "href": "/classifiers/finance-documents"
        }
      }
    }
  ]
}
 

Add samples from ZIP file

Add a set of sample documents, labelled with their document types, and saved in a ZIP file to a classifier.

 
posthttps://api.waives.io/classifiers/classifier_name/samples

Path Params

classifier_name
string
required

The name of the Classifier to add the samples to

The request body should contain the binary contents of the samples ZIP file.

The Preparing sample documents article explains how to create a set of sample documents and a samples ZIP file.

RESPONSES

200 The samples were added to the Classifier
400 There is no file supplied in the body, the file supplied is not a ZIP file or the contents of the ZIP file are invalid (details of the exact problem are included in the error response).
401 There is no Authorization header or the access token is invalid
404 The specified Classifier does not exist
415 The Content-Type header is missing or invalid. Currently only application/zip is supported.

POST /classifiers/CLASSIFIER_NAME/samples HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
Content-Type: application/zip

Raw ZIP file content
curl https://api.cloudhub360.com/classifiers/CLASSIFIER_NAME/samples \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/zip" -X POST \
-T "samples.zip"
A binary file was returned

You couldn't be authenticated

{
  "samples": [
    {
      "path": "Agreements/office-rental-agreement.pdf",
      "document_type": "Agreements"
    },
    {
      "path": "Agreements/leighton-acquisition.pdf",
      "document_type": "Agreements"
    },
    {
      "path": "Expenses/emerson-tunt-nov-2016.pdf",
      "document_type": "Expenses"
    },
    {
       "path": "Expenses/jessie-palmer-jun-2017.pdf",
       "document_type": "Expenses"
    },
  ],
    "_embedded": {
        "classifier": {
          "name": "CLASSIFIER_NAME",
          "_links": {
            "self": {
              "href": "/classifiers/CLASSIFIER_NAME"
            },
            "classifier:add_sample": {
              "href": "/classifiers/CLASSIFIER_NAME/sample/{document_type}",
              "templated": true
            },
            "classifier:add_samples_from_zip": {
              "href": "/classifiers/CLASSIFIER_NAME/samples"
            },
            "classifier:get": {
              "href": "/classifiers/CLASSIFIER_NAME"
            }
          }
       }
    }
}
 

Add single sample file

Add a single sample file to a classifier

 
posthttps://api.waives.io/classifiers/classifier_name/sample/document_type

Path Params

classifier_name
string
required

The name of the Classifier to add the sample to

document_type
string
required

The document type of the sample

retrain
boolean
required

Whether the classifier should be retrained after adding the sample

The request body should contain the binary contents of the sample file and the Content-Type header should be set to the MIME-type of the file.

Correct use of the "retrain" query parameter

Once samples have been added to a classifier, the classifier must be "trained". During this process the classifier analyses the samples and determines the defining characteristics of each document type. Training can only be done when there are samples (that are not empty) of at least two document types.

For optimal performance of requests to this endpoint you should only train once, when all the samples you intend to add have been added. Training multiple times won't hurt but will make requests slower.

The retrain query parameter can be used to control whether training happens after the sample is added.

When starting from a new (empty) classifier you must always set retrain=false for the first samples until you have added samples for at least two document types.

Ideally you should set retrain=false for all except the very last sample you want to add, so the training is performed only once.

Supported file types

Files of the following file types can be used as samples:

  • PDFs that contain electronic content
  • Microsoft Office Word, Excel or PowerPoint documents
  • Text files

Image files and PDFs without electronic content cannot be used as samples. You should OCR these first and use the resulting documents as samples instead.

File Type
Content-Type header

PDFs (with electronic content)

application/pdf

Microsoft Office Word (.docx) documents

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Microsoft Office Excel (.xlsx) documents

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Microsoft Office PowerPoint (.pptx) documents

application/vnd.openxmlformats-officedocument.presentationml.presentation

Text files

text/plain

Note that you will need to make multiple requests to this endpoint to sufficiently train a classifier. Generally it is easier to use the Add samples from ZIP file endpoint, and you should only use this endpoint if you are tightly integrating training into another system, have very large samples, have a very large number of samples or it is inconvenient to build a ZIP file. The Add samples from ZIP file endpoint is also substantially faster when adding multiple samples.

RESPONSES

200 The sample was added to the Classifier
400 There is no file supplied in the body
401 There is no Authorization header or the access token is invalid
404 The specified Classifier does not exist
415 The Content-Type header is missing, contains an unsupported type, does not match the actual contents of the file, or the file is a PDF that does not include content (details of the exact problem are included in the error response).

POST /classifiers/CLASSIFIER_NAME/sample/DOCUMENT_TYPE HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
Content-Type: SAMPLE_CONTENT_TYPE

Raw sample file content
curl https://api.cloudhub360.com/classifiers/CLASSIFIER_NAME/sample/DOCUMENT_TYPE \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: SAMPLE_CONTENT_TYPE" -X POST \
-T "sample.pdf"
A binary file was returned

You couldn't be authenticated

{
  "samples": [
    {
      "path": null,
      "document_type": "DOCUMENT_TYPE"
    }
  ],
    "_embedded": {
        "classifier": {
          "name": "CLASSIFIER_NAME",
          "_links": {
            "self": {
              "href": "/classifiers/CLASSIFIER_NAME"
            },
            "classifier:add_sample": {
              "href": "/classifiers/CLASSIFIER_NAME/sample/{document_type}",
              "templated": true
            },
            "classifier:add_samples_from_zip": {
              "href": "/classifiers/CLASSIFIER_NAME/samples"
            },
            "classifier:get": {
              "href": "/classifiers/CLASSIFIER_NAME"
            }
          }
       }
    }
}
 

Delete classifier

Delete an existing classifier

 
deletehttps://api.waives.io/classifiers/classifier_name

Path Params

classifier_name
string
required

The name of the classifier

RESPONSES

204 The extractor was deleted, or did not already exist
401 There is no Authorization header or the access token is invalid

DELETE /classifiers/CLASSIFIER_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/classifiers/CLASSIFIER_NAME \
-H "Authorization: Bearer ACCESS_TOKEN"
-X DELETE
A binary file was returned

You couldn't be authenticated

No response examples available
 

Create extractor

Create a new extractor from an extractor configuration file.

 
posthttps://api.waives.io/extractors/extractor_name

Path Params

extractor_name
string
required

The desired name for the extractor

The request body should contain the binary contents of an extractor configuration file.

If you have documents you wish to extract data from, please talk to us and we can either create a configuration for you or help you to install the offline extraction configuration tool and train you to use it.

A number of off-the-shelf configurations for extracting header and item data from invoices are available on request from the CloudHub360 team. Versions tuned for various different countries are available.

RESPONSES

201 The extractor was created
400 There is already an extractor with the specified name
401 There is no Authorization header or the access token is invalid

POST /extractors/EXTRACTOR_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/extractors/EXTRACTOR_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X POST \
-T "my_extractor.fpxlc"
A binary file was returned

You couldn't be authenticated

{
  "name": "EXTRACTOR_NAME",
  "_links": {
    "self": {
      "href": "/extractors/EXTRACTOR_NAME"
    },
    "extracter:get": {
      "href": "/extractors/EXTRACTOR_NAME"
    }
  }
}
 

Get extractor

Get the details of an existing extractor

 
gethttps://api.waives.io/extractors/extractor_name

RESPONSES

200 The extractor exists, and its details are included in the response
401 There is no Authorization header or the access token is invalid
404 There is no extractor with the specified name

GET /extractors/EXTRACTOR_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/extractors/EXTRACTOR_NAME \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "name": "EXTRACTOR_NAME",
  "_links": {
    "self": {
      "href": "/extracters/EXTRACTOR_NAME"
    },
    "extracter:get": {
      "href": "/extracters/EXTRACTOR_NAME"
    }
  }
}
 

Get all extractors

Get the details of all extractors

 
gethttps://api.waives.io/extractors

RESPONSES

200 The details of all extractors are contained in the response
401 There is no Authorization header or the access token is invalid

GET /extractors HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/extractors \
-H "Authorization: Bearer ACCESS_TOKEN" \
-X GET
A binary file was returned

You couldn't be authenticated

{
  "extractors": [
    {
      "name": "invoice-us",
      "_links": {
        "self": {
          "href": "/extractors/invoice-us"
        },
        "extracter:get": {
          "href": "/extractors/invoice-us"
        }
      }
    },
    {
      "name": "invoice-uk",
      "_links": {
        "self": {
          "href": "/extractors/invoice-uk"
        },
        "extracter:get": {
          "href": "/extractors/invoice-uk"
        }
      }
    }
  ]
}
 

Delete extractor

Delete an existing extractor

 
deletehttps://api.waives.io/extractors/extractor_name

Path Params

extractor_name
string
required

The name of the extractor

RESPONSES

204 The extractor was deleted, or did not already exist
401 There is no Authorization header or the access token is invalid

DELETE /extractors/EXTRACTOR_NAME HTTP/1.1
Authorization: Bearer ACCESS_TOKEN
curl https://api.cloudhub360.com/extractors/EXTRACTOR_NAME \
-H "Authorization: Bearer ACCESS_TOKEN"
-X DELETE
A binary file was returned

You couldn't be authenticated

No response examples available