All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Make
crowsetta.SimpleSeq.from_filework with an "empty" csv file, one that has no annotated segments (e.g. because no audio was above threshold for segmenting) #280. Fixes #264. - Add
default_labelargument tocrowsetta.SimpleSeq.from_filethat will add labels to segments in a csv file if there are none #280. Fixes #271. - Add example csv file from Jourjine et al. 2023 dataset #280. Fixes #274.
- Add how-to showing how to work with unannotated segmentation in a csv file, using the csv from the Jourjine et al. 2023 dataset #280. Fixes #275.
- Rename
crowsetta.datatocrowsetta.examples, simplify howcrowsetta.exampleworks (to be more likevocalpy.example) and give all the example annotation files more specific names so that we can have multiple examples per annotation format #280. Fixes #278. - Import format classes at package level, so we can just type e.g.
crowsetta.SimpleSeqinstead ofcrowsetta.formats.seq.SimpleSeq("flat is better than nested") #280. Fixes #273.
- Make
crowsetta.SimpleSeq.from_fileusecolumns_maparg to rename only columns whose names are keys in the supplied dict, and ignore other columns in the csv file #280. Fixes #272.
- Replace deprecated
pandera.SchemaModelwithDataFrameModel, fixesAttributeErroron import after new install #266. Fixes #265. - Change range of Python version supported from 3.9-3.11 to 3.10-3.12 #268. Fixes #267.
- Vendor code from evfuncs and birdsong-recognition-dataset packages, to reduce the number of dependencies and make sure the code is maintained #263. Fixes #262.
- Fix bug in "generic-seq" format; use validated dataframe returned by pandera schema, so that "label" column is coerced to strings #258. Fixes #257.
- Add information on contributing and setting up a development environment #212. Fixes #30.
- Add method to convert generic sequence format to a pandas DataFrame #216.
- Add additional vignettes to docs: on removing "silent" labels from TextGrid annotations, on converting to the simple sequence and generic sequence formats #216. Fixes #152 and #197.
- Add format class for Audacity extended label track format #226. Fixes #222 and #213.
- Add the ability for a crowsetta.Annotation to have multiple sequences #243. Fixes #42.
- Rewrite TextGrid class to better handle file formats: parse both "short" and default format in either UTF-8 or UTF-16 encoding; remove empty intervals from interval tiers by default; can convert multiple interval tiers to a single crowsetta.Annotation with multiple crowsetta.Sequences #243. Fixes #241
- Revise landing page of docs, and some vignettes. Make other changes to clean up the docs build process #216.
- Coerce path-like attributes of
GenericSeqdataframe schema to be strings. This helps ensure these columns are always native Pandas types #237. - Fix how the
crowsetta.Segmentclass converts onset sample and offset sample to int; correctly handle multiple numpy integer subtypes #238.
- c6ba100 Fix description and uri in pyproject.toml and crowsetta/about.py
- f70828f Make README images link to raw GitHub files so they render on PyPI
- add Raven format #164. Fixes #84.
- add example data #180. Fixes #90.
- add examples to docstrings, using example data #180. Fixes #158.
- import
register_formatat top level of package, to be able to just write@crowsetta.register_format#181. Fixes #177. - add
'aud-txt'format, for Audacity standard LabelTracks exported to .txt files #183. Fixes #96. - add ability to extract example data to local file system;
avoids need to use context manager returned by
importlib.resourcesto access the example data files. #185. Fixes #184. - add logo #198. Fixes #17.
- change
Annotationclass to represent both sequence-like annotation formats and bounding box-like annotation formats #164. Resolves #149 and #150. - re-design API, and rewrite annotation formats as classes
#161.
- Re-writing as classes fixes #99.
- API re-design fixes #120.
- Adds an
interfacesub-package that specifies an interface for two types of annotations: sequence-like and bounding-box like. Fixes #105 - All existing annotation formats were sequence-like, and they now adhere to that interface; the classes are registered as sub-classes.
- Formats themselves are now in a
formatssub-package, fixes #109 - Add better functions to list the formats in this sub-package
(fixes #92);
can call
crowsetta.formats.as_listto get a list of shorthand string names, andcrowsetta.formats.by_namewith the shorthand string name to get back to the corresponding class. Transcriber.from_filenow returns an instance of an annotation format classes. Methods liketo_annotcan be called on this instance. This refactor greatly simplifies theTranscriberclass while maintaining mostly the same API (now need to chain calls likeTranscriber.from_file().to_annot(), or capture the returned annotation instance in a variable and use it instead). Fixes #144.
- convert docs to markdown and use
myst-parser#153. Fixes #151. - require Python >= 3.8 to adhere to NEP-29 #168. Fixes #166.
- rename
Annotation.audio_pathattribute tonotated_pathto be more general, e.g., because annotations can also annotate a spectrogram #169. Fixes #148. - rename
onset_indandoffset_indtoonset_sampleandoffset_samplefor clarity #174. Fixes #156. - rename first parameter of
from_filemethod for all format classes toannot_pathfor consistency. #182. Fixes #178. - Revise documentation #191. Fixes #152 as well as #21, #35, #138, and #157.
- have
formats.as_listreturn listsorted(i.e., alphabetically) #194. Fixes #187.
- fix
crowsetta.formats.register_formatfunction added in #161 and rewrite example custom annotation formats to use it #176. Fixes #119.
- remove
Stackclass -- was not being used #172. Fixes #170. - remove deprecated
'csv'format that was replaced by'generic-seq'#173. Fixes #171. - remove
Metaclass -- no longer used #193. Fixes #190.
-
change format names 'simple-csv' and 'csv' to 'simple-seq' and 'generic-seq'. With goal of eventually having 'simple-seq' work on other file formats, e.g. .txt, and for 'csv' to be the "generic" sequence format that allow for converting between others. #140. Fixes #133.
-
deprecate the name 'csv' for the 'generic-seq' format; a FutureWarning is raised when creating a
Transcriberwithformat='csv'. #143. Fixes #141. -
switch to using
noxfor development, instead ofmake#137. Fixes [#132](#132.
- change dependency / format name
koumuratobirdsong-recognition-datasetbecause package was renamed #126. Fixes #124. - switch to using
flitto build / publish. Removepoetry. #127. Fixes #125. - move
textgridpackage into sub-package_vendor, sinceflitonly works with a single top-level package. #127. This is the approachpiptakes, as discussed on pypa/flit#497. - rename attributes / variables
onsets_Hzandoffsets_Hztoonset_indsandoffset_inds#128. Fixes #87. - rename function
crowsetta.validation._parse_filetovalidate_ext#129. Fixes #123.
- add a CITATION.cff file #103.
- add
'yarden'format, that parses the.matfiles saved bySongAnnotationGUI, and is used with the canary song dataset that accompanies thetweetynetpaper. #122. Fixes #121.
- rewrite tests to use
pytest#106 Fixes #89. - change compatible Python versions to >3.6 and <3.10 #111.
- switch from using Make to using nox for development tasks #137. As suggsted by Scikit-HEP. Fixes #132.
- Fix .TextGrid and .phn docstrings that referred to ".not.mat files" #118.
- add missing
packagesto pyproject.toml so thattextgridis included in build 857ba09 - add metadata to pyproject.toml so that README is used as "long description" and appears on PyPI e8b8209
- switch to using
poetryfor development #79 - raise minimum version of
evfuncsto 0.3.1 #79 - raise minimum version of
koumurato 0.2.0 #79 - change to using GitHub Actions for continuous integration #83
- fix dependencies and Python so they are not pinned to major version #83
- fix
phn2annotfunction so it works with.PHNand.WAVfiles found in some versions of TIMIT dataset #75- needed to make extension checking case-insensitive, see issue #68
- and also switch to
soundfilelibrary to be able to parse the specific NIST format of .WAV files
- add missing comma in
ENTRY_POINTSinsetup.pyso that built-in formats are properly installed 599149f
- change name of
Transcriberparameterannot_formatto justformat#64 - change name of
Annotationattributesannot_fileandaudio_filetoannot_pathandaudio_path, for clarity and to match what's used in thevaklibrary #65
- add
phnmodule that parses.phnfiles from TIMIT dataset #59
- change types of
Annotationattributesannot_fileandaudio_pathfromstr(string) topathlib.Path, to fix errors raised when passing inPathobjects (because the attribute validator requires a string), and because it's preferable to work withPathobjects over strings #52 - change default value for
koumura2annotparameterwavpathso that the function will work regardless of current working directory for user, instead of requiring them to be in the parent directory of the.wavfiles thatwavpathrefers to #53
- fixed error that
koumura2annotfunction threw whenannot_filewas apathlib.Pathand not a string #53
- modify functions for
.not.matannotation files (created by evsonganaly GUI) so they do not require other files such as.recfiles (created by evTAF data acquisition program)notmat.notmat2annotno longer looks for.recfiles, which it used to get the sampling rate and convert onsets and offsets from seconds to Hz
- the
make_notmatfor creating.not.matfiles fromAnnotations also now expects onsets and offsets in seconds, not Hz.- the idea being that one can go from
.not.mattoAnnotationand back without doing any extra conversion. If user needs conversion to Hz for some other reason they can do this using theAnnotation
- the idea being that one can go from
- add
Annotationclass- which has 'audio_file' and 'annot_file' attributes, along with 'seq' attribute
- rewrite everything centered around
Annotationclass- meaning
SequenceandSegmentlose their redundant 'file' attributes and all format modules convert to and fromAnnotationsand so does the csv module
- meaning
- single-source version
- now found in an
__about__.pyfile insrc/crowsettathat is used bysetup.py.
- now found in an
segmentsproperty of aSequenceis a tuple, not a list, so that class is immutable + hashable
__hash__implementation forSequenceclass- convert attributes that are
numpy.ndarrays into tuples before hashing
- convert attributes that are
- tests for
Sequence- no longer assert that calling
__hash__raisesNotImplementedError - test that
segmentsattribute is atuplenot alist
- no longer assert that calling
- implement hashing and equality for
Sequenceclass- this makes it possible to use with concurrency, e.g. with the Dask library
- entry point group
crowsetta.formatto make it possible to 'install' formats- removes special casing for built-in formats, they just get added via entry point
- instead of parsing a config.json file built into the package
- module for working with Praat Textgrid format
Metaclass which represents metadata about a format- such as file extension associated with it
- and the module / functions that a
Transcriberinstance should use to work with this format
- Each instance of
Transcriberhas only one vocal annotation format that it handles- because it's annoying to type
file_formatevery time you call a method liketo_seq - instead you just make an instance of
Transcriberfor each format you want - This also works better with
crowsetta.formatentry points andMetaclass; when you instantiate aTranscriberfor a givenvoc_format, the__init__uses theMetafor that format to figure out which function to use forto_seq,to_csv, etc. - For this reason bumping to 1.0.0, new
Transcribernot backwards compatible- although this will be inconvenient for millions of people
- because it's annoying to type
- Sequence instances have attributes: labels, onsets_s, offsets_s, onset_inds, offset_inds, and file.
- Explanation of default
to_csvfunction for user formats inhowto-user-config.
- Sequence class totally re-written
- no longer attrs-based
- because of somewhat complicated logic for validating arguments that was necessary in init (to prevent user from creating a 'bad' instance.)
- Sequences are immutable. Idea is they are just connectors between annotation and whatever user needs to do with it so you shouldn't need to change any attribute values after loading annotation
- Segment also immutable (by setting frozen=True in call to attr.s decorator)
- Transcriber.init uses config.json instead of config.ini to read defaults
- this makes init logic more readable since we don't have to convert user_config dict to strings and then back again; default config just loads as a dict from the .json file and we add the user_config dicts to it
datamodule that downloads small example datasets for each annotation format- includes
formatsfunction that is imported at package level and prints formats built in tocrowsetta
- includes
to_seq_func_to_csvthat takes ayourformat2seqfunction and returns a function that will convert the same format to csv files (just a wrapper around your function andseq2csv)- for docs, Makefile that generates
./notebooksfolder from./doc/notebooks
- major revamp of docs
config_dicts foruser_configarg of Transcriber.init only requiremoduleandto_seqkeys;to_csvandto_formatare optional, can be specified PythonNoneor a string'None'
- Transcriber raises
NotImplementederror whento_csvorto_formatare None for a specified format (instead of crashing mysteriously) seq2csvandcsv2seqcan deal withNonevalues for one pair of onsets and offsets
- fix failing tests
Segmentclass, attrs-based- has
asdictmethod (wrapper aroundattrsfunction) - has class variable
_FIELDSwhich is used in any place where we need to know how to go fromSegmentattributes to rows of a csv file, e.g. in src/crowsetta/csv.py and in tests
- has
Sequenceclass is now attrs-based, has factory functions, is itself just a list ofSegments- now has
to_dictmethod
- now has
Crowsettaclass is now calledTranscriber
- add Crowsetta class with simple interface for converting any annotation to
- add ability to work with user-defined functions
- user passes an
extra_configdict when instantiating Crowsetta
- user passes an
- add docs
- change package name to Crowsetta
- change function names so they are all 'format2seq' or 'format2csv' or 'toformat' for consistency
- Initial version after excising from hvc (https://github.com/NickleDave/hybrid-vocal-classifier/blob/master/hvc/utils/annotation.py)
- Convert tests to Python unittest format (instead of using PyTest library)
- Write README.md with usage