Feature extraction (特徵擷取) [5]
CountVectorizer implements both tokenization (英文分詞) and occurrence counting(計算英文文字出現計數) in a single class:
This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
說明:
fit函式代表tokenize,加入到字典陣列vocabulary
fit(raw_documents[, y]) | Learn a vocabulary dictionary of all tokens in the raw documents. |
transform(raw_documents) | Transform documents to document-term matrix. |
fit_transform(raw_documents[, y]) | Learn the vocabulary dictionary and return term-document matrix. |
The default configuration tokenizes the string by extracting words of at least 2 letters. Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
說明:
經過特徵擷取後,可以利用get_feature_names()取得特徵索引字串字典,接著對映到結果計數陣列。
呼叫X.toarray()可以看出,每個文件,例[0, 1, 1, 1, 0, 0, 1, 0, 1]所對映到的英文字計數....
get_feature_names() | Array mapping from feature integer indices to feature name |
References:
1. Machine Learning Tutorial: The Naive Bayes Text Classifier
2. Naive Bayes
3. Working With Text Data — scikit-learn 0.16.1 documentation
4. Text Classification
5. Feature extraction
Videos:
高一下數學3-0引言01什麼是機率
高一下數學3-3A觀念01條件機率的概念
沒有留言:
張貼留言