View on GitHub

capstone

A set of tools to generate and label dataset from academic papers

Capstone Dataset Tools

Status

Building:

CircleCI

Coverage:

codebeat badge

What does it do?

Prerequisites

Install Latexml, instructions can be found here

Install python dependency pip3 install -r requirements.txt

How to convert tex files

Create a file meta.json inside the tex folder which indicates the entry point which has the following format:

{
  "tex_filename": "main.tex"
}

Put the unzipped tex file folder under data/tex_files

Then go to the root dir of the project and run python3 convert.py

Once completed, the output html files will be placed in data/html_files

How to label data

Go to the root dir of the project, run ./run_tools.sh

Then the labeling tool will automatically open in default browser (chrome recommended).

where search will search all the sentences containing the symbol of choice within the document of choice.

Here is an example of search results, you can edit the labels as strings. To save the changes in json file. “Save to overall” will save it to the overall json file with many documents in it. “Save to separate” will save it to a separate file named as data/outputs/[document_name]_[symbol_expression].json. “Save to both” will literally do both as the same time.