PDF Server

[NEW] Python client-side library available here

Structure of This Document

Introduction
Getting Started
Admin Site
RESTful API

Introduction

Background

PDF ebooks usually have table of content (TOC). Hence, the texts in a PDF ebook are naturally in a certain hierarchy. Such text and hierarchy information extracted from PDF ebooks can be used for research in machine learning (especially hierarchical text classification) and natural language processing.

The goal of this project is to provide a tool to manage and maintain a huge database of texts extracted from PDF ebooks and allow users to access and upload versioned data through RESTful API endpoints. This framework enables easy sharing of text data and boosts collaborations among researchers in a team, or across different teams.

Project Scope

The purpose of this project is to provide a RESTful backend and an admin site to

Organize and manage PDF ebooks
Extract TOC hierarchy and section texts from PDF ebooks
Store all data extracted from PDF ebooks in relational databases
Access the TOC, text and more through a handlful of RESTful APIs
Post your own version of processed texts of a book/chapter and share it with other researchers
(*Optional) clean up and lemmatize the extracted texts using the built-in cleaner.

Skills & Tools Required

Adobe Acrobat (2015 DC or later) is required to pre-process the PDF files, delete useless pages/chapters, update/correct bookmarks, and convert them into HTML files split by bookmarks. The PDF and HTML files will be then fed into the system.
Adequate knowledge of Django is required to be able to maintain and further expand this project. Please make sure you understand Django well enough before reading the rest of this document.
Due to copyright issues, we cannot distribute PDF ebooks or any processed text extracted from them. This project contents only the source code, without any database or files. You are supposed to setup your own database and connent django to it before running the system.

Text Cleaning Techniques

The built-in cleaner performs the following operations sequentially on the extracted texts:

Perform known replacements (e.g. "ﬁ" -> "fi"; if you don't see any difference, try to copy the first "ﬁ" as TWO letters, "f" and "i". You will realize you can't, because "ﬁ" is actually ONE unicode character).
Fix broken line-joins (e.g. "fric- tion" -> "friction").
Remove redundent newline and space.
Remove duplicate paragraphs.
Remove other obvious noises generated by Acrobat Pro.

The following built-in features in Cleaner are NOT enabled by default. You may manually customize the cleaning procedure by modifying the source code in the extractor package.

Replace non-ascii characters with underscore ("_")
Remove all URIs
Remove all emails
Remove stop words according the the MySQL stop word list
Remove all punctuation marks
Remove all digits
Remove one-letter and two-letter words
Lemmatize

Getting Started

Install Python 3
Clone the project
Install python dependencies:
$ pip install -r requirements.txt
Setup your own database and update the connection configuration in settings.py, then migrate.
Run the development server
$ python3 manage.py runserver
(*Optional) Download WordNet data for NLTK

Admin Site

Create Superuser

This is a standard admin site of Django. It can be accessed by the URL <your-domain>/admin/. If you are running the default Django development server, the complete URL is http://127.0.0.1:8000/admin/.

You may need to create a superuser for the first time to log in to the admin site. This can be done by

$ python3 manage.py createsuperuser

If you didn't know about this yet, please learn Django first before trying the following steps. See Introduction.

User Groups and Object-level Permission Control

Before creating any normal user, you must create a user group in Django Group model with ONLY the following permissions for each model:

book | book | Can change book
section | section | Can change section
ontent | content | Can add content
content | content | Can change content
content | content | Can delete content
version | version | Can add version
version | version | Can change version
version | version | Can delete version

When creating a normal user, tick "Staff status" and select the user group above. Then the new user can access a special admin site with object-level permission control. Such user ("normal user") can only view the models, and update/delete the entries they own (e.g. the version/content they created).

Create an Entry for the `Book` Model

The following fields are required to create a new entry.

Field	Explanation
Title	The title of the PDF book
Toc html path	The full path of the html file generated by Adobe Acrobat that contains the table of content*.

*NOTE: There must be a directory of the same name next to the html fie. This is the default output format of Adobe Acrobat when converting PDF to HTML with the "Split by Bookmarks" option turned on. e.g.

Process a Book

After creating a Book entry,

Go back to the model page /admin/book/book
Tick the book that was newly created
Select "Process book" in the "Action" dropdown menu
Click "GO"

This will take from a few seconds to a few minutes for most cases, depending on the structure and the length of the book, and the time complexity of the text cleaning techniques.

You may browse other pages, or add another book, once you clicked "GO". You can always go back to the book list page and visually check the "is processed" flag to determine whether the processing of a newly added book is completed. Also, you can view the progress in the terminal where you started the Django server.

RESTful API

Overview

The docs and an emulated client are available at http://<your-domain>/docs/

Only authenticated users can access any of the APIs. By default, only Basic Authentication and Session Authentication are enabled in settings.py. However, you may customize the authentication process to include more authentication methods.

The root URL of all RESTful APIs is /api/v1 (e.g. the book-list api is at http://<your-admin>/api/v1/book/list/). There are four sub-groups of API endpoints: Book, Section, Version and Content.

Group	URL
Book	`/book`
Section	`/section`
Version	`/version`
Content	`/content`

Book

Endpoint	URL	Method	Permission
List	`/list/`	GET	Any user
Detail	`/detail/{pk}/`	GET	Any user
TOC	`/toc/{pk}/`	GET	Any user

List

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`

Example of response data if successful:

[
	{
		"id": 2,
		"title": "My Sample Handbook",
		"root_section": 25
	},
	{
		"id": 3,
		"title": "My Sample Textbook",
		"root_section": 72
	}
]

More details on the fields:

Field	Type	Explanation
`id`	`int`	The id of the book
`title`	`string`	The title of the book
`root_section`	`int`	The id of the root section that represents the entire book

Detail

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the book

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such book available	`404 Not Found`

Example of response data if successful:

{
	"id": 7,
	"title": "Digital Signal Processing System Analysis and Design",
	"root_section": 1011
}

This is a single element of the list returned by the List API above.

{
	"title": "My Sample Handbook",
	"id": 25,
	"children": [
		{
			"title": "Chapter 1",
			"id": 26,
			"children": [
				{
					"title": "Section 1.1",
					"id": 27,
					"children": []
				},
				{
					"title": "Section 1.2",
					"id": 28,
					"children": []
				}
			]
		},
		{
			"title": "Chapter 2",
			"id": 29,
			"children": []
		}
	]
}

This is a nested, recursive JSON that represents the table of content tree. Each node in the tree has the following fields:

Field	Type	Explanation
`title`	`string`	The title of the section
`id`	`int`	The id of the section
`children`	`array`	An array of immediate children nodes of the current node

Section

Endpoint	URL	Method	Permission
Detail	`/detail/{pk}/`	GET	Any user
Children	`/children/{pk}/`	GET	Any user
Versions	`/versions/{pk}/`	GET	Any user
Partial TOC	`/toc/{pk}`	GET	Any user

Detail

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section available	`404 Not Found`

Example of response data if successful:

{
	"title": "Section 1.1",
	"id": 27,
	"has_children": false
}

Children

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section available	`404 Not Found`

The response is an array of "Detail"s in the /detail/{pk}/ API denoting the children of the queried section.

Versions

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section available	`404 Not Found`

Example of response data if successful:

[
	{
		"id": 1,
		"name": "Raw",
		"created_by": null,
		"timestamp": "2016-09-02 20:00:00"
	}
	{
		"id": 2,
		"name": "Cleaned text for machine",
		"created_by": "admin",
		"timestamp": "2016-09-02 21:00:00"
	}
]

The response is an array of all the versions associated with the section. See the Version API below for more details.

Partial TOC from a Section

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section available	`404 Not Found`

Example of response data if successful:

{
	"title": "Chapter 1",
	"id": 26,
	"children": [
		{
			"title": "Section 1.1",
			"id": 27,
			"children": []
		},
		{
			"title": "Section 1.2",
			"id": 28,
			"children": []
		}
	]
}

The response is the partial TOC starting from the queried section. For more details on TOC, see the TOC API (under Book) above.

Version

Endpoint	URL	Method	Permission
List	`/list`	GET	Any user
Detail	`/detail/{pk}/`	GET	Any user
Create	`/create/`	PUT	Any user
Update	`/update/{pk}/`	POST	Version creator
Delete	`/delete/{pk}/`	DELETE	Version creator

List

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`

Example of response data if successful:

[
	{
		"id": 1,
		"name": "Raw",
		"created_by": null,
		"timestamp": "2016-09-02 20:00:00"
	}
	{
		"id": 2,
		"name": "Cleaned text for machine",
		"created_by": "admin",
		"timestamp": "2016-09-02 21:00:00"
	}
]

The response is an array of "Version"s. The default version, "Raw" is included.

More details on the fields:

Field	Type	Explanation
`id`	`int`	The id of the version
`name`	`string`	The name of the version
`created_by`	`string`	The user id for the creator of the version
`timestamp`	`string`	The timestamp for version creation

Detail

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the version

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such version available	`404 Not Found`

Response data if successful:

{
	"id": 1,
	"name": "Raw",
	"created_by": null,
	"timestamp": "2016-09-02 20:00:00"
}

The response is a single element of the array returned by the List API above.

Create

Request example:

{
	"name": "Cleaned text for coref"
}

HTTP responses:

Case	HTTP Response
Successful	`201 Created`
Login credentials not accepted	`401 Unauthorized`

Example of response data if successful:

{
	"id": 3,
	"name": "Cleaned text for coref",
	"created_by": "admin",
	"timestamp": "2016-09-02 22:00:00"
}

Update

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the version

Request example:

{
	"name": "Another name"
}

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
User is not the version creator	`403 Forbidden`
No such version available	`404 Not Found`

Example of response data if successful:

{
	"id": 3,
	"name": "Another name",
	"created_by": "admin",
	"timestamp": "2016-09-02 22:00:00"
}

Delete

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the version

HTTP responses:

Case	HTTP Response
Successful	`204 No Content`
Login credentials not accepted	`401 Unauthorized`
User is not the version creator	`403 Forbidden`
No such version available	`404 Not Found`

When a version is deleted, all contents associated with that version will be deleted as well. For more details on contents, see the Content API below.

Content

Endpoint	URL	Method	Permission
Immediate Text	`/immediate/{section}/{version}/`	GET	Any user
Aggregate Text	`/aggregate/{section}/{version}/`	GET	Any user
Post	`/post/{section}/{version}/`	POST	Version creator

Immediate Text

Parameters in URL:

Parameter	Type	Explanation
`section`	`int`	The id of the section
(*optional) `version`	`int`	The id of the version. If no version is specified, the raw version will be chosen by default.

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section/ version available	`404 Not Found`

Example of response data if successful:

The response body contains the clear text of a certain version of the section.

Aggregate Text

Parameters in URL:

Parameter	Type	Explanation
`section`	`int`	The id of the section
(*optional) `version`	`int`	The id of the version. If no version is specified, the raw version will be chosen by default.

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
No such section/ version available	`404 Not Found`

Example of response data if successful:

The response body contains the clear text of a certain version of the section, 
and ALL ITS DESCENDANTS, in the original page order.

Post

Parameters in URL:

Parameter	Type	Explanation
`section`	`int`	The id of the section
`version`	`int`	The id of the version

The body of the request contains the location of the text file with contents of a specific version of a section, i.e. output.txt

HTTP responses:

Case	HTTP Response
Successful	`200 OK`
Login credentials not accepted	`401 Unauthorized`
User is not the version creator	`403 Forbidden`
No such section/ version available	`404 Not Found`

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
api		api
book		book
content		content
extractor		extractor
pdf_server		pdf_server
section		section
static		static
version		version
.gitignore		.gitignore
README.md		README.md
license.txt		license.txt
manage.py		manage.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF Server

[NEW] Python client-side library available here

Structure of This Document

Introduction

Background

Project Scope

Skills & Tools Required

Text Cleaning Techniques

Getting Started

Admin Site

Create Superuser

User Groups and Object-level Permission Control

Create an Entry for the Book Model

Process a Book

RESTful API

Overview

Book

List

Detail

TOC

Section

Detail

Children

Versions

Partial TOC from a Section

Version

List

Detail

Create

Update

Delete

Content

Immediate Text

Aggregate Text

Post

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Create an Entry for the `Book` Model

Packages