Skip to content

thousandlemons/pdf-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Server

[NEW] Python client-side library available here

Structure of This Document

Background

PDF ebooks usually have table of content (TOC). Hence, the texts in a PDF ebook are naturally in a certain hierarchy. Such text and hierarchy information extracted from PDF ebooks can be used for research in machine learning (especially hierarchical text classification) and natural language processing.

The goal of this project is to provide a tool to manage and maintain a huge database of texts extracted from PDF ebooks and allow users to access and upload versioned data through RESTful API endpoints. This framework enables easy sharing of text data and boosts collaborations among researchers in a team, or across different teams.

Project Scope

The purpose of this project is to provide a RESTful backend and an admin site to

  • Organize and manage PDF ebooks
  • Extract TOC hierarchy and section texts from PDF ebooks
  • Store all data extracted from PDF ebooks in relational databases
  • Access the TOC, text and more through a handlful of RESTful APIs
  • Post your own version of processed texts of a book/chapter and share it with other researchers
  • (*Optional) clean up and lemmatize the extracted texts using the built-in cleaner.

Skills & Tools Required

  • Adobe Acrobat (2015 DC or later) is required to pre-process the PDF files, delete useless pages/chapters, update/correct bookmarks, and convert them into HTML files split by bookmarks. The PDF and HTML files will be then fed into the system.
  • Adequate knowledge of Django is required to be able to maintain and further expand this project. Please make sure you understand Django well enough before reading the rest of this document.
  • Due to copyright issues, we cannot distribute PDF ebooks or any processed text extracted from them. This project contents only the source code, without any database or files. You are supposed to setup your own database and connent django to it before running the system.

Text Cleaning Techniques

The built-in cleaner performs the following operations sequentially on the extracted texts:

  1. Perform known replacements (e.g. "fi" -> "fi"; if you don't see any difference, try to copy the first "fi" as TWO letters, "f" and "i". You will realize you can't, because "fi" is actually ONE unicode character).
  2. Fix broken line-joins (e.g. "fric- tion" -> "friction").
  3. Remove redundent newline and space.
  4. Remove duplicate paragraphs.
  5. Remove other obvious noises generated by Acrobat Pro.

The following built-in features in Cleaner are NOT enabled by default. You may manually customize the cleaning procedure by modifying the source code in the extractor package.

  1. Replace non-ascii characters with underscore ("_")
  2. Remove all URIs
  3. Remove all emails
  4. Remove stop words according the the MySQL stop word list
  5. Remove all punctuation marks
  6. Remove all digits
  7. Remove one-letter and two-letter words
  8. Lemmatize

Getting Started

  1. Install Python 3
  2. Clone the project
  3. Install python dependencies:
    $ pip install -r requirements.txt
  4. Setup your own database and update the connection configuration in settings.py, then migrate.
  5. Run the development server
    $ python3 manage.py runserver
  6. (*Optional) Download WordNet data for NLTK

Admin Site

Create Superuser

This is a standard admin site of Django. It can be accessed by the URL <your-domain>/admin/. If you are running the default Django development server, the complete URL is http://127.0.0.1:8000/admin/.

You may need to create a superuser for the first time to log in to the admin site. This can be done by

$ python3 manage.py createsuperuser

If you didn't know about this yet, please learn Django first before trying the following steps. See Introduction.

User Groups and Object-level Permission Control

Before creating any normal user, you must create a user group in Django Group model with ONLY the following permissions for each model:

  • book | book | Can change book
  • section | section | Can change section
  • ontent | content | Can add content
  • content | content | Can change content
  • content | content | Can delete content
  • version | version | Can add version
  • version | version | Can change version
  • version | version | Can delete version

When creating a normal user, tick "Staff status" and select the user group above. Then the new user can access a special admin site with object-level permission control. Such user ("normal user") can only view the models, and update/delete the entries they own (e.g. the version/content they created).

Create an Entry for the Book Model

The following fields are required to create a new entry.

Field Explanation
Title The title of the PDF book
Toc html path The full path of the html file generated by Adobe Acrobat that contains the table of content*.

*NOTE: There must be a directory of the same name next to the html fie. This is the default output format of Adobe Acrobat when converting PDF to HTML with the "Split by Bookmarks" option turned on. e.g.

Process a Book

After creating a Book entry,

  1. Go back to the model page /admin/book/book
  2. Tick the book that was newly created
  3. Select "Process book" in the "Action" dropdown menu
  4. Click "GO"

This will take from a few seconds to a few minutes for most cases, depending on the structure and the length of the book, and the time complexity of the text cleaning techniques.

You may browse other pages, or add another book, once you clicked "GO". You can always go back to the book list page and visually check the "is processed" flag to determine whether the processing of a newly added book is completed. Also, you can view the progress in the terminal where you started the Django server.

RESTful API

Overview

The docs and an emulated client are available at http://<your-domain>/docs/

Only authenticated users can access any of the APIs. By default, only Basic Authentication and Session Authentication are enabled in settings.py. However, you may customize the authentication process to include more authentication methods.

The root URL of all RESTful APIs is /api/v1 (e.g. the book-list api is at http://<your-admin>/api/v1/book/list/). There are four sub-groups of API endpoints: Book, Section, Version and Content.

Group URL
Book /book
Section /section
Version /version
Content /content

Book

Endpoint URL Method Permission
List /list/ GET Any user
Detail /detail/{pk}/ GET Any user
TOC /toc/{pk}/ GET Any user

List

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized

Example of response data if successful:

[
	{
		"id": 2,
		"title": "My Sample Handbook",
		"root_section": 25
	},
	{
		"id": 3,
		"title": "My Sample Textbook",
		"root_section": 72
	}
]

More details on the fields:

Field Type Explanation
id int The id of the book
title string The title of the book
root_section int The id of the root section that represents the entire book

Detail

Parameters in URL:

Parameter Type Explanation
pk int The id of the book

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such book available 404 Not Found

Example of response data if successful:

{
	"id": 7,
	"title": "Digital Signal Processing System Analysis and Design",
	"root_section": 1011
}

This is a single element of the list returned by the List API above.

TOC

Parameters in URL:

Parameter Type Explanation
pk int The id of the book

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such book available 404 Not Found

Example of response data if successful:

{
	"title": "My Sample Handbook",
	"id": 25,
	"children": [
		{
			"title": "Chapter 1",
			"id": 26,
			"children": [
				{
					"title": "Section 1.1",
					"id": 27,
					"children": []
				},
				{
					"title": "Section 1.2",
					"id": 28,
					"children": []
				}
			]
		},
		{
			"title": "Chapter 2",
			"id": 29,
			"children": []
		}
	]
}

This is a nested, recursive JSON that represents the table of content tree. Each node in the tree has the following fields:

Field Type Explanation
title string The title of the section
id int The id of the section
children array An array of immediate children nodes of the current node

Section

Endpoint URL Method Permission
Detail /detail/{pk}/ GET Any user
Children /children/{pk}/ GET Any user
Versions /versions/{pk}/ GET Any user
Partial TOC /toc/{pk} GET Any user

Detail

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section available 404 Not Found

Example of response data if successful:

{
	"title": "Section 1.1",
	"id": 27,
	"has_children": false
}

Children

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section available 404 Not Found

The response is an array of "Detail"s in the /detail/{pk}/ API denoting the children of the queried section.

Versions

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section available 404 Not Found

Example of response data if successful:

[
	{
		"id": 1,
		"name": "Raw",
		"created_by": null,
		"timestamp": "2016-09-02 20:00:00"
	}
	{
		"id": 2,
		"name": "Cleaned text for machine",
		"created_by": "admin",
		"timestamp": "2016-09-02 21:00:00"
	}
]

The response is an array of all the versions associated with the section. See the Version API below for more details.

Partial TOC from a Section

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section available 404 Not Found

Example of response data if successful:

{
	"title": "Chapter 1",
	"id": 26,
	"children": [
		{
			"title": "Section 1.1",
			"id": 27,
			"children": []
		},
		{
			"title": "Section 1.2",
			"id": 28,
			"children": []
		}
	]
}

The response is the partial TOC starting from the queried section. For more details on TOC, see the TOC API (under Book) above.

Version

Endpoint URL Method Permission
List /list GET Any user
Detail /detail/{pk}/ GET Any user
Create /create/ PUT Any user
Update /update/{pk}/ POST Version creator
Delete /delete/{pk}/ DELETE Version creator

List

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized

Example of response data if successful:

[
	{
		"id": 1,
		"name": "Raw",
		"created_by": null,
		"timestamp": "2016-09-02 20:00:00"
	}
	{
		"id": 2,
		"name": "Cleaned text for machine",
		"created_by": "admin",
		"timestamp": "2016-09-02 21:00:00"
	}
]

The response is an array of "Version"s. The default version, "Raw" is included.

More details on the fields:

Field Type Explanation
id int The id of the version
name string The name of the version
created_by string The user id for the creator of the version
timestamp string The timestamp for version creation

Detail

Parameters in URL:

Parameter Type Explanation
pk int The id of the version

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such version available 404 Not Found

Response data if successful:

{
	"id": 1,
	"name": "Raw",
	"created_by": null,
	"timestamp": "2016-09-02 20:00:00"
}

The response is a single element of the array returned by the List API above.

Create

Request example:

{
	"name": "Cleaned text for coref"
}

HTTP responses:

Case HTTP Response
Successful 201 Created
Login credentials not accepted 401 Unauthorized

Example of response data if successful:

{
	"id": 3,
	"name": "Cleaned text for coref",
	"created_by": "admin",
	"timestamp": "2016-09-02 22:00:00"
}

Update

Parameters in URL:

Parameter Type Explanation
pk int The id of the version

Request example:

{
	"name": "Another name"
}

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
User is not the version creator 403 Forbidden
No such version available 404 Not Found

Example of response data if successful:

{
	"id": 3,
	"name": "Another name",
	"created_by": "admin",
	"timestamp": "2016-09-02 22:00:00"
}

Delete

Parameters in URL:

Parameter Type Explanation
pk int The id of the version

HTTP responses:

Case HTTP Response
Successful 204 No Content
Login credentials not accepted 401 Unauthorized
User is not the version creator 403 Forbidden
No such version available 404 Not Found

When a version is deleted, all contents associated with that version will be deleted as well. For more details on contents, see the Content API below.

Content

Endpoint URL Method Permission
Immediate Text /immediate/{section}/{version}/ GET Any user
Aggregate Text /aggregate/{section}/{version}/ GET Any user
Post /post/{section}/{version}/ POST Version creator

Immediate Text

Parameters in URL:

Parameter Type Explanation
section int The id of the section
(*optional) version int The id of the version.
If no version is specified, the raw version will be chosen by default.

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section/ version available 404 Not Found

Example of response data if successful:

The response body contains the clear text of a certain version of the section. 

Aggregate Text

Parameters in URL:

Parameter Type Explanation
section int The id of the section
(*optional) version int The id of the version.
If no version is specified, the raw version will be chosen by default.

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
No such section/ version available 404 Not Found

Example of response data if successful:

The response body contains the clear text of a certain version of the section, 
and ALL ITS DESCENDANTS, in the original page order.

Post

Parameters in URL:

Parameter Type Explanation
section int The id of the section
version int The id of the version

The body of the request contains the location of the text file with contents of a specific version of a section, i.e. output.txt

HTTP responses:

Case HTTP Response
Successful 200 OK
Login credentials not accepted 401 Unauthorized
User is not the version creator 403 Forbidden
No such section/ version available 404 Not Found

About

A PDF management server with RESTful APIs to boost collaboration among natural language processing (NLP) research teams

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages