星期一, 4月 20, 2015

Python Text Classification using Naive Bayes and scikit-learn


Feature extraction (特徵擷取) [5]

CountVectorizer implements both tokenization (英文分詞) and occurrence counting(計算英文文字出現計數) in a single class:
>>>
>>> from sklearn.feature_extraction.text import CountVectorizer
This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
>>>
>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
說明: 
fit函式代表tokenize,加入到字典陣列vocabulary 
fit(raw_documents[, y])Learn a vocabulary dictionary of all tokens in the raw documents.
transform函式用來計計算英文文字出現計數
transform(raw_documents)Transform documents to document-term matrix.
fit_transform(raw_documents[, y])Learn the vocabulary dictionary and return term-document matrix.
>>>
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 matrix="" numpy.int64="" of="" sparse="" type="">'
    with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
>>>
>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

說明: 
經過特徵擷取後,可以利用get_feature_names()取得特徵索引字串字典,接著對映到結果計數陣列。
呼叫X.toarray()可以看出,每個文件,例[0, 1, 1, 1, 0, 0, 1, 0, 1]所對映到的英文字計數....
'This is the first document.' ->['and', 'document', 'first', 'is', 'one','second', 'the', 'third', 'this']


get_feature_names()Array mapping from feature integer indices to feature name



References:

1. Machine Learning Tutorial: The Naive Bayes Text Classifier

2. Naive Bayes

3. Working With Text Data — scikit-learn 0.16.1 documentation

4. Text Classification

5. Feature extraction

Videos:

高一下數學3-0引言01什麼是機率

高一下數學3-3A觀念01條件機率的概念


沒有留言: