PHP natural Language Processing Tools
17 Feb 2018Classification Models
Topic Modeling
Lda is still experimental and quite slow but it works. See an example.
Clustering
- K-Means
- Hierarchical Agglomerative Clustering
- SingleLink
- CompleteLink
- GroupAverage
Tokenizers
- WhitespaceTokenizer
- WhitespaceAndPunctuationTokenizer
- PennTreebankTokenizer
- RegexTokenizer
- ClassifierBasedTokenizer This tokenizer allows us to build a lot more complex tokenizers than the previous ones
Documents
- TokensDocument represents a bag of words model for a document.
- WordDocument represents a single word with the context of a larger document.
- TrainingDocument represents a document whose class is known.
- TrainingSet a collection of TrainingDocuments
Feature factories
- FunctionFeatures Allows the creation of a feature factory from a number of callables
- DataAsFeatures Simply return the data as features.
Similarity
Stemmers
Optimizers (MaxEnt only)
- A gradient descent optimizer (written in php) for educational use. It is a simple implementation for anyone wanting to know a bit more about either GD or MaxEnt models
- A fast (faster than nltk-scipy), parallel gradient descent optimizer written in Go. This optimizer resides in another repo, it is used via the external optimizer. TODO: At least write a readme for the optimizer written in Go.
Other
- Idf Inverse document frequency
- Stop words
- Language based normalizers
- Classifier based transformation for creating flexible preprocessing pipelines