A HIGHLY ADVANCED CONTENT ANALYSIS AND TEXT-MINING SOFTWARE WITH UNMATCHED HANDLING AND ANALYSIS CAPABILITIES,
WordStat is a flexible and easy-to-use text analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – and QDA Miner – our qualitative data analysis software – gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.
WHAT IT IS USED FOR?
WordStat can be used by anyone who needs to quickly extract and analyze information from large amounts of documents. Our content analysis and text mining software is used for:
• Content analysis of open-ended responses, interview or focus group transcripts
• Business intelligence and competitive web sites analysis
• Information extraction and knowledge discovery from incident reports, customer complaints
• Content analysis of news coverage or scientific literature
• Automatic tagging and classification of documents
• Fraud detection, authorship attribution, patent analysis
• Taxonomy development and validation
KEY AND UNIQUE FEATURES
|Powerful CONTENT ANALYSIS AND TEXT MINING SOFTWARE for handling large amounts of unstructured information. WordStat can process up to 20 million words per minute and identify all references to user-defined concepts using categorization dictionaries.|
|Integrated EXPLORATORY TEXT MINING AND VISUALIZATION TOOLS such as clustering, multidimensional scaling, proximity plots, and more, to quickly extract themes and automatically identify patterns.|
|RELATES UNSTRUCTURED TEXT WITH STRUCTURED DATA such as dates, numbers or categorical data for identifying temporal trends or differences between subgroups or for assessing relationship with ratings or other kinds of categorical or numerical data.|
|Use existing or create your own HIERARCHICAL CONTENT ANALYSIS DICTIONARIES OR TAXONOMIES composed of words, word patterns, phrases as well as proximity rules (such as NEAR, AFTER, BEFORE) for achieving precise measurement of concepts.|
|Truly unique COMPUTER ASSISTANCE FOR DICTIONARY BUILDING with tools for extracting common phrases and technical terms and for quickly identifying in your text collection, misspellings, synonyms, antonyms and related words.|
|One click access to KEYWORD-IN-CONTEXT AND KEYWORD RETRIEVAL TOOLS for easy identification and coding of relevant text segments, validation of content analysis dictionaries, word-sense disambiguation or for drilling down to the source documents.|
|Seamless integration with a state of the art QUALITATIVE CODING TOOL (QDA Miner), allows more precise exploration of data or more in-depth analysis of specific documents or extracted text segments when needed.|
|MACHINE LEARNING FOR AUTOMATIC DOCUMENT CLASSIFICATION using Naive Bayes and K-Nearest Neighbours algorithms with automatic features selection and validation tools. Classification models may then be saved on disk and reapplied on new data.|
|Easy IMPORTATION of databases, spreadsheets and documents (including PDF and HTML) as well as EXPORTATION of text analysis results to common industry file formats (Excel, SPSS, ASCII, HTML, XML, MS Word) and graphs (PNG, BMP and JPEG).|
LIST OF FEATURES
TEXT PROCESSING CAPABILITIES
- Content analysis on collection of ANSI or RTF document (several mb each) and short alphanumeric variables (up to 255 characters).
- Dictionary moderated lemmatization and stemming (English, French, Italian, German and Spanish; contact us for other languages).
- Ability to call external text pre-processing EXE or DLL (sample English porter stemmer and n-grams transformation are include)
- Optional exclusion of pronouns, conjunctions, etc, by the use of user-defined exclusion lists (or stop list).
- Categorization of words or phrases using existing or user-defined dictionaries.
- Word categorization based on Boolean (AND, OR, NOT) and proximity rules (NEAR, AFTER, BEFORE)
- Word and phrase substitution and scoring using wildcards and weighting.
- Frequency analysis on keywords, phrases, derived categories or concepts, or user-defined codes entered manually within a text.
- Interactive development and easy maintenance of hierarchical dictionaries, taxonomies, or categorization schema.
- Drag and drop editor for easy assignments of words, phrases into categories!
- Ability to restrict the analysis to specific portions of a text or to exclude comments and annotations.
- Ability to perform an analysis on a random sample of cases.
- Integrated spell-checking with support for more than 20 languages such as English, French, Spanish, etc.
- Integrated thesaurususe to assist the creation of taxonomies and comprehensive categorization schemas (English, French, Spanish, Italian, Portuguese and German).
- Powerful case filtering on any numeric or alphanumeric field and on code occurrence (with AND, OR, and NOT boolean operators)
- Prints presentation quality tables
- Imports ANSI and Unicode text files, MS Word, WordPerfect, RTF and HTML, PDF.
- Exports any table to Excel,SPSS, ASCII, Tab separated or comma separated value files, or HTML files.
- Flexible keyword highlighting (the text editor can display all categories using different colors).
UNIVARIATE KEYWORD FREQUENCY ANALYSIS
- Univariate word frequency analysis (word or category count and record occurrence).
- Word x word co-occurrence matrix.
- Word x case data matrix.
- Integrated multidimensional scaling with 2D and 3D maps.
- Proximity plot.
- Vocabulary finder extracts technical terms, product and company names as well as common misspellings.
- Phrase finder allows one to easily identify recurring phrases and expressions
NORM CREATION AND COMPARISON
- Ability to create norm files based on frequency analysis of words or content categories.
- Comparison of obtained frequencies to previously saved norm files.
KEYWORD RETRIEVAL FUNCTION
- A powerful keyword retrieval function allows identification of text units (documents, paragraph or sentences) containing one keyword or a combination of keywords with optional filtering of cases.
- Ability to attach QDA Miner codes to retrieved segments.
- Retrieved segments may be exported to disk in tabular format (Excel or delimited text files) or as text reports (Rich Text Format).
KEYWORD CO-OCCURRENCE ANALYSIS
- Integrated clustering and dendrogram display of keyword co-occurrence.
- First- and second-order proximity analysis.
- Proximity plot to easily identify all keywords that co-occurs with a target keyword.
- 2D and 3D multidimensional scaling on either joint frequency or co-occurrence of words or categories.
- Flexible keyword co-occurrence criteria (within a case, a sentence, a paragraph, a window of n words, a user-defined segment) as well as clustering methods (first- and second-order proximity, choice of similarity measures).
- Easy text retrieval from dendrogram or proximity plots.
ANALYSIS OF CASE OR DOCUMENT SIMILARITY
- Hierarchical clustering, multidimensional scaling and proximity plot may be used to explore the similarity between documents or cases.
MULTIPLE RESPONSES AND COMPARISONS
- Can perform univariate frequency analysis and crosstabulation on information stored in several alphanumeric fields (memo or string variables).
- Comparison of keyword occurrence between different fields.
- Computes inter-raters agreement measures (pct. of agreement, Cohen’s Kappa, Scott’s Pi, Krippendorff’s R and r-bar, free marginal) based on codes manually entered in different variables.
BIVARIATE COMPARISONS BETWEEN SUBGROUPS
- Bivariate comparison between any textual field and any nominal or ordinal variable (such as the sex of the respondent, specific subgroups, years of publication, etc.).
- Choice between 11 different association measures to assess the relationship between word occurrence and nominal or ordinal variables (Chi-square, Likelihood ratio, Tau-a, Tau-b, Tau-c, symmetric Somers’ D, asymmetric Somers’ Dxy and Dyx, Gamma, Person’s R, Spearman’s Rho)
- Computation statistics on either absolute or relative frequency
- Ability to sort matrix in alphabetic order of words, by word frequency or word occurrence, on the obtained statistics or on its probability.
- Visually compare items between subgroups using bar charts and line charts.
- Correspondence analysis (statistics, 2D & 3D joint plots). This feature is accessible from the crosstab page and allows one to see graphically the relationship between nominal variables and codes resulting from a content analysis.
- Heatmap plot (with dual-clustering of keywords and variables)
AUTOMATED TEXT CLASSIFICATION
- Machine learning algorithms (Naive Bayes and K-Nearest Neighbors) for document classification.
- Flexible feature selection for automatic selection of best subsets of attributes.
- Numerous validation methods (leave-but-one, n-fold crossvalidation, split sample).
- Experimentation module allows easy comparison of predictive models and fine-tuning of classification models.
- Classification models may be saved to disk and applied later using either a standalone document classification utility program, a command line program or a programming library . Note: The command line and the programming library are part of WordStat Software Developer’s kit (SDK) which is sold separately.
- Ability to display a KWIC table to examine the textual context of a word, word pattern, or category.
- Ability to sort the table on any independent (numeric) variables.
- Ability to jump from a KWIC keyword to the textual variable in order to view or edit the original text.
- KWIC list can be saved in data files for further processing.
- Customizable KWIC display (paragraph, sentence or user defined segment).
- Concordance report (displays all hits as a list of paragraphs, sentences or user defined segments)
FULL INTEGRATION WITH A STATISTICAL SOFTWARE
- Alphanumeric variables can be stored in the same file as all other numeric variables.
- Variable selection, statistical analysis and content analysis are performed within the same application program.
- Matrix outputs are automatically added to existing statistical outputs.
- New variables representing occurrence of words, keywords or concepts can be added to the existing data file or exported to a new data file in order to be submitted to further statistical analysis (such as cluster analysis on words or cases, principal coordinate analysis, correspondence analysis, multiple regression, etc.).
- Data can be imported from and exported to different file format including dBase, Paradox, Excel, Quattro Pro, Lotus 1-2-3, SPSS for DOS, SPSS for Windows, comma or tab separated text files, etc.
- Ability to perform numeric and alphanumeric transformation or to apply filters on records of the data file to restrict the analysis to specific subgroups. .
- Dictionary building assistant to find related words (synonyms, antonyms, holonyms, meronyms, hypernyms, hyponyms) in a WordNet based thesaurus (English only). (100,000 synonyms, 120,000 root words)
- WS Document Classifier, a small standalone application to apply previously saved categorization and classification models to external documents.
- Document Conversion Wizard- Utility program to easily import documents. Various file formats may be directly imported such as Plain text (ANSI, Unicode) HTML, RTF, MS Word, WordPerfect, Adobe PDF
- Optional removal of leading and trailing spaced and hard returns.
- Extraction of numeric, alphanumeric and date variables from structured documents.
- Extraction options may be saved on disk and later retrieved.
- Documents may be stored as plain ANSI text or as RTF documents.