[NEW] Python client-side library available here
PDF ebooks usually have table of content (TOC). Hence, the texts in a PDF ebook are naturally in a certain hierarchy. Such text and hierarchy information extracted from PDF ebooks can be used for research in machine learning (especially hierarchical text classification) and natural language processing.
The goal of this project is to provide a tool to manage and maintain a huge database of texts extracted from PDF ebooks and allow users to access and upload versioned data through RESTful API endpoints. This framework enables easy sharing of text data and boosts collaborations among researchers in a team, or across different teams.
The purpose of this project is to provide a RESTful backend and an admin site to
- Organize and manage PDF ebooks
- Extract TOC hierarchy and section texts from PDF ebooks
- Store all data extracted from PDF ebooks in relational databases
- Access the TOC, text and more through a handlful of RESTful APIs
- Post your own version of processed texts of a book/chapter and share it with other researchers
- (*Optional) clean up and lemmatize the extracted texts using the built-in cleaner.
- Adobe Acrobat (2015 DC or later) is required to pre-process the PDF files, delete useless pages/chapters, update/correct bookmarks, and convert them into HTML files split by bookmarks. The PDF and HTML files will be then fed into the system.
- Adequate knowledge of Django is required to be able to maintain and further expand this project. Please make sure you understand Django well enough before reading the rest of this document.
- Due to copyright issues, we cannot distribute PDF ebooks or any processed text extracted from them. This project contents only the source code, without any database or files. You are supposed to setup your own database and connent django to it before running the system.
The built-in cleaner performs the following operations sequentially on the extracted texts:
- Perform known replacements (e.g. "fi" -> "fi"; if you don't see any difference, try to copy the first "fi" as TWO letters, "f" and "i". You will realize you can't, because "fi" is actually ONE unicode character).
- Fix broken line-joins (e.g. "fric- tion" -> "friction").
- Remove redundent newline and space.
- Remove duplicate paragraphs.
- Remove other obvious noises generated by Acrobat Pro.
The following built-in features in Cleaner are NOT enabled by default. You may manually customize the cleaning procedure by modifying the source code in the extractor package.
- Replace non-ascii characters with underscore ("_")
- Remove all URIs
- Remove all emails
- Remove stop words according the the MySQL stop word list
- Remove all punctuation marks
- Remove all digits
- Remove one-letter and two-letter words
- Lemmatize
- Install Python 3
- Clone the project
- Install python dependencies:
$ pip install -r requirements.txt - Setup your own database and update the connection configuration in
settings.py, then migrate. - Run the development server
$ python3 manage.py runserver - (*Optional) Download WordNet data for NLTK
This is a standard admin site of Django. It can be accessed by the URL <your-domain>/admin/. If you are running the default Django development server, the complete URL is http://127.0.0.1:8000/admin/.
You may need to create a superuser for the first time to log in to the admin site. This can be done by
$ python3 manage.py createsuperuserIf you didn't know about this yet, please learn Django first before trying the following steps. See Introduction.
Before creating any normal user, you must create a user group in Django Group model with ONLY the following permissions for each model:
- book | book | Can change book
- section | section | Can change section
- ontent | content | Can add content
- content | content | Can change content
- content | content | Can delete content
- version | version | Can add version
- version | version | Can change version
- version | version | Can delete version
When creating a normal user, tick "Staff status" and select the user group above. Then the new user can access a special admin site with object-level permission control. Such user ("normal user") can only view the models, and update/delete the entries they own (e.g. the version/content they created).
The following fields are required to create a new entry.
| Field | Explanation |
|---|---|
| Title | The title of the PDF book |
| Toc html path | The full path of the html file generated by Adobe Acrobat that contains the table of content*. |
*NOTE: There must be a directory of the same name next to the html fie. This is the default output format of Adobe Acrobat when converting PDF to HTML with the "Split by Bookmarks" option turned on. e.g.
After creating a Book entry,
- Go back to the model page
/admin/book/book - Tick the book that was newly created
- Select "Process book" in the "Action" dropdown menu
- Click "GO"
This will take from a few seconds to a few minutes for most cases, depending on the structure and the length of the book, and the time complexity of the text cleaning techniques.
You may browse other pages, or add another book, once you clicked "GO". You can always go back to the book list page and visually check the "is processed" flag to determine whether the processing of a newly added book is completed. Also, you can view the progress in the terminal where you started the Django server.
The docs and an emulated client are available at http://<your-domain>/docs/
Only authenticated users can access any of the APIs. By default, only Basic Authentication and Session Authentication are enabled in settings.py. However, you may customize the authentication process to include more authentication methods.
The root URL of all RESTful APIs is /api/v1 (e.g. the book-list api is at http://<your-admin>/api/v1/book/list/). There are four sub-groups of API endpoints: Book, Section, Version and Content.
| Group | URL |
|---|---|
| Book | /book |
| Section | /section |
| Version | /version |
| Content | /content |
| Endpoint | URL | Method | Permission |
|---|---|---|---|
| List | /list/ |
GET | Any user |
| Detail | /detail/{pk}/ |
GET | Any user |
| TOC | /toc/{pk}/ |
GET | Any user |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
Example of response data if successful:
[
{
"id": 2,
"title": "My Sample Handbook",
"root_section": 25
},
{
"id": 3,
"title": "My Sample Textbook",
"root_section": 72
}
]More details on the fields:
| Field | Type | Explanation |
|---|---|---|
id |
int |
The id of the book |
title |
string |
The title of the book |
root_section |
int |
The id of the root section that represents the entire book |
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the book |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such book available | 404 Not Found |
Example of response data if successful:
{
"id": 7,
"title": "Digital Signal Processing System Analysis and Design",
"root_section": 1011
}This is a single element of the list returned by the List API above.
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the book |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such book available | 404 Not Found |
Example of response data if successful:
{
"title": "My Sample Handbook",
"id": 25,
"children": [
{
"title": "Chapter 1",
"id": 26,
"children": [
{
"title": "Section 1.1",
"id": 27,
"children": []
},
{
"title": "Section 1.2",
"id": 28,
"children": []
}
]
},
{
"title": "Chapter 2",
"id": 29,
"children": []
}
]
}This is a nested, recursive JSON that represents the table of content tree. Each node in the tree has the following fields:
| Field | Type | Explanation |
|---|---|---|
title |
string |
The title of the section |
id |
int |
The id of the section |
children |
array |
An array of immediate children nodes of the current node |
| Endpoint | URL | Method | Permission |
|---|---|---|---|
| Detail | /detail/{pk}/ |
GET | Any user |
| Children | /children/{pk}/ |
GET | Any user |
| Versions | /versions/{pk}/ |
GET | Any user |
| Partial TOC | /toc/{pk} |
GET | Any user |
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the section |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section available | 404 Not Found |
Example of response data if successful:
{
"title": "Section 1.1",
"id": 27,
"has_children": false
}Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the section |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section available | 404 Not Found |
The response is an array of "Detail"s in the /detail/{pk}/ API denoting the children of the queried section.
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the section |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section available | 404 Not Found |
Example of response data if successful:
[
{
"id": 1,
"name": "Raw",
"created_by": null,
"timestamp": "2016-09-02 20:00:00"
}
{
"id": 2,
"name": "Cleaned text for machine",
"created_by": "admin",
"timestamp": "2016-09-02 21:00:00"
}
]The response is an array of all the versions associated with the section. See the Version API below for more details.
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the section |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section available | 404 Not Found |
Example of response data if successful:
{
"title": "Chapter 1",
"id": 26,
"children": [
{
"title": "Section 1.1",
"id": 27,
"children": []
},
{
"title": "Section 1.2",
"id": 28,
"children": []
}
]
}The response is the partial TOC starting from the queried section. For more details on TOC, see the TOC API (under Book) above.
| Endpoint | URL | Method | Permission |
|---|---|---|---|
| List | /list |
GET | Any user |
| Detail | /detail/{pk}/ |
GET | Any user |
| Create | /create/ |
PUT | Any user |
| Update | /update/{pk}/ |
POST | Version creator |
| Delete | /delete/{pk}/ |
DELETE | Version creator |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
Example of response data if successful:
[
{
"id": 1,
"name": "Raw",
"created_by": null,
"timestamp": "2016-09-02 20:00:00"
}
{
"id": 2,
"name": "Cleaned text for machine",
"created_by": "admin",
"timestamp": "2016-09-02 21:00:00"
}
]The response is an array of "Version"s. The default version, "Raw" is included.
More details on the fields:
| Field | Type | Explanation |
|---|---|---|
id |
int |
The id of the version |
name |
string |
The name of the version |
created_by |
string |
The user id for the creator of the version |
timestamp |
string |
The timestamp for version creation |
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the version |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such version available | 404 Not Found |
Response data if successful:
{
"id": 1,
"name": "Raw",
"created_by": null,
"timestamp": "2016-09-02 20:00:00"
}The response is a single element of the array returned by the List API above.
Request example:
{
"name": "Cleaned text for coref"
}HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 201 Created |
| Login credentials not accepted | 401 Unauthorized |
Example of response data if successful:
{
"id": 3,
"name": "Cleaned text for coref",
"created_by": "admin",
"timestamp": "2016-09-02 22:00:00"
}Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the version |
Request example:
{
"name": "Another name"
}HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| User is not the version creator | 403 Forbidden |
| No such version available | 404 Not Found |
Example of response data if successful:
{
"id": 3,
"name": "Another name",
"created_by": "admin",
"timestamp": "2016-09-02 22:00:00"
}Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
pk |
int |
The id of the version |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 204 No Content |
| Login credentials not accepted | 401 Unauthorized |
| User is not the version creator | 403 Forbidden |
| No such version available | 404 Not Found |
When a version is deleted, all contents associated with that version will be deleted as well. For more details on contents, see the Content API below.
| Endpoint | URL | Method | Permission |
|---|---|---|---|
| Immediate Text | /immediate/{section}/{version}/ |
GET | Any user |
| Aggregate Text | /aggregate/{section}/{version}/ |
GET | Any user |
| Post | /post/{section}/{version}/ |
POST | Version creator |
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
section |
int |
The id of the section |
(*optional) version |
int |
The id of the version. If no version is specified, the raw version will be chosen by default. |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section/ version available | 404 Not Found |
Example of response data if successful:
The response body contains the clear text of a certain version of the section.
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
section |
int |
The id of the section |
(*optional) version |
int |
The id of the version. If no version is specified, the raw version will be chosen by default. |
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| No such section/ version available | 404 Not Found |
Example of response data if successful:
The response body contains the clear text of a certain version of the section,
and ALL ITS DESCENDANTS, in the original page order.
Parameters in URL:
| Parameter | Type | Explanation |
|---|---|---|
section |
int |
The id of the section |
version |
int |
The id of the version |
The body of the request contains the location of the text file with contents of a specific version of a section, i.e. output.txt
HTTP responses:
| Case | HTTP Response |
|---|---|
| Successful | 200 OK |
| Login credentials not accepted | 401 Unauthorized |
| User is not the version creator | 403 Forbidden |
| No such section/ version available | 404 Not Found |