Jerry's Space: Python Text Classification using Naive Bayes and scikit-learn

Feature extraction (特徵擷取) [5]

CountVectorizer implements both tokenization (英文分詞) and occurrence counting(計算英文文字出現計數) in a single class:

>>>

>>> from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):

>>>

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

說明:

fit函式代表tokenize，加入到字典陣列vocabulary

fit(raw_documents[, y]) Learn a vocabulary dictionary of all tokens in the raw documents.

transform函式用來計計算英文文字出現計數

transform(raw_documents) Transform documents to document-term matrix.

fit_transform(raw_documents[, y]) Learn the vocabulary dictionary and return term-document matrix.

>>>

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 matrix="" numpy.int64="" of="" sparse="" type="">'
    with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of at least 2 letters. Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

>>>

>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

說明:
經過特徵擷取後，可以利用get_feature_names()取得特徵索引字串字典，接著對映到結果計數陣列。
呼叫X.toarray()可以看出，每個文件，例[0, 1, 1, 1, 0, 0, 1, 0, 1]所對映到的英文字計數....

'This is the first document.' ->['and', 'document', 'first', 'is', 'one','second', 'the', 'third', 'this']

get_feature_names() Array mapping from feature integer indices to feature name

References:

1. Machine Learning Tutorial: The Naive Bayes Text Classifier

2. Naive Bayes

3. Working With Text Data — scikit-learn 0.16.1 documentation

4. Text Classification

5. Feature extraction

Videos:

高一下數學3-0引言01什麼是機率

高一下數學3-3A觀念01條件機率的概念

Jerry's Space

星期一, 4月 20, 2015

Python Text Classification using Naive Bayes and scikit-learn

沒有留言:

標籤