Financial Sentiment Analysis¶
By: Jordy Alfaro Brenes
Date: April 2025
This notebook analyzes sentiment in financial texts using a combination of natural language processing and machine learning techniques. The dataset contains financial sentences labeled with sentiment (positive, negative, or neutral).
Dataset Information¶
The dataset combines data from FiQA and Financial PhraseBank, providing financial sentences with sentiment labels. It's intended for advancing financial sentiment analysis research.
Citation: Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.
# 1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')
# Set visualization style
plt.style.use('fivethirtyeight')
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
# 2. Load and Explore the Data
# Load the dataset
df = pd.read_csv('data.csv')
# Display basic information
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
display(df.head())
# Check columns and data types
print("\nData types:")
df.info()
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Check class distribution
print("\nClass distribution:")
print(df['Sentiment'].value_counts())
print(df['Sentiment'].value_counts(normalize=True) * 100)
# Display examples from each class
print("\nExample of positive sentence:")
print(df[df['Sentiment'] == 'positive']['Sentence'].iloc[0])
print("\nExample of negative sentence:")
print(df[df['Sentiment'] == 'negative']['Sentence'].iloc[0])
if 'neutral' in df['Sentiment'].unique():
print("\nExample of neutral sentence:")
print(df[df['Sentiment'] == 'neutral']['Sentence'].iloc[0])
Dataset shape: (5842, 2) First 5 rows:
| Sentence | Sentiment | |
|---|---|---|
| 0 | The GeoSolutions technology will leverage Bene... | positive |
| 1 | $ESI on lows, down $1.50 to $2.50 BK a real po... | negative |
| 2 | For the last quarter of 2010 , Componenta 's n... | positive |
| 3 | According to the Finnish-Russian Chamber of Co... | neutral |
| 4 | The Swedish buyout firm has sold its remaining... | neutral |
Data types: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5842 entries, 0 to 5841 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sentence 5842 non-null object 1 Sentiment 5842 non-null object dtypes: object(2) memory usage: 91.4+ KB Missing values: Sentence 0 Sentiment 0 dtype: int64 Class distribution: Sentiment neutral 3130 positive 1852 negative 860 Name: count, dtype: int64 Sentiment neutral 53.577542 positive 31.701472 negative 14.720986 Name: proportion, dtype: float64 Example of positive sentence: The GeoSolutions technology will leverage Benefon 's GPS solutions by providing Location Based Search Technology , a Communities Platform , location relevant multimedia content and a new and powerful commercial model . Example of negative sentence: $ESI on lows, down $1.50 to $2.50 BK a real possibility Example of neutral sentence: According to the Finnish-Russian Chamber of Commerce , all the major construction companies of Finland are operating in Russia .
# 3.1 Sentiment Distribution Visualization
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='Sentiment', data=df, palette='viridis')
plt.title('Distribution of Financial Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Count', fontsize=14)
# Add count labels on top of bars
for p in ax.patches:
ax.annotate(f'{p.get_height()}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'bottom',
fontsize=12)
plt.tight_layout()
plt.show()
Sentiment Distribution Analysis¶
The chart shows an imbalanced distribution of sentiments in the financial dataset:
- Neutral: 3,130 examples (53.7%)
- Positive: 1,852 examples (31.8%)
- Negative: 860 examples (14.5%)
This imbalance is significant and typical in financial data, where:
Neutral predominance: Most financial communications tend to be objective and factual, avoiding excessively positive or negative language.
Positive bias: There are nearly twice as many positive examples as negative ones, potentially reflecting a general optimism in financial communications or a tendency to present information favorably.
Negative minority: Texts with negative sentiment represent only 14.5% of the total, which presents challenges for model training as it will have fewer examples to learn from for this sentiment type.
Implications for Modeling¶
This class imbalance will have several important implications for our classification model:
Risk of bias: The model might become biased toward majority classes (neutral and positive) and struggle to correctly detect negative sentiment.
Performance evaluation: Overall accuracy might not be the best metric due to the imbalance. We'll need to analyze precision, recall, and F1-score for each class.
Strategies to consider:
- Undersampling or oversampling techniques to balance classes
- Class weights in algorithms that support them
- Threshold adjustment for decision boundaries
- Focus on F1-score as an evaluation metric
# 3.2 Text Length Analysis
df['text_length'] = df['Sentence'].apply(len)
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='text_length', hue='Sentiment', bins=50, kde=True, palette='viridis')
plt.title('Distribution of Text Length by Sentiment', fontsize=16)
plt.xlabel('Text Length (characters)', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xlim(0, df['text_length'].quantile(0.99)) # Remove outliers for better visualization
plt.tight_layout()
plt.show()
Text Length Distribution Analysis by Sentiment¶
The histogram shows how text length varies for each sentiment category:
Key Observations¶
General distribution: All sentiment types show approximately normal distributions with positive skew (right tail), indicating that most texts are of moderate length, with some extremely long texts.
Differences by sentiment:
- Neutral: Shows the greatest variability in length and tends to have longer texts on average, with a peak around 90-110 characters.
- Positive: Concentrated mainly in the 50-100 character range, with a peak near 60 characters.
- Negative: Tends to have the shortest texts, with a peak around 70 characters, and fewer long texts.
Analytical implications:
- More concise negative texts might reflect direct statements about problems or losses.
- Longer neutral texts likely present detailed factual information or technical descriptions.
- The intermediate length of positive texts might indicate a balance between delivering good news and maintaining a professional tone.
Modeling Considerations¶
- The variation in length could be an informative feature for the model, so we might consider including text length as an additional feature.
- We could explore length normalization for some vectorization techniques.
- For very long or very short texts, it may be useful to examine whether the model makes more errors, suggesting the need for additional techniques to handle these extreme cases.
# 3.3 Word Count Analysis
df['word_count'] = df['Sentence'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(12, 6))
sns.boxplot(x='Sentiment', y='word_count', data=df, palette='viridis')
plt.title('Word Count Distribution by Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Word Count', fontsize=14)
plt.ylim(0, df['word_count'].quantile(0.99)) # Remove outliers for better visualization
plt.tight_layout()
plt.show()
Word Count Distribution Analysis by Sentiment¶
The boxplot displays the distribution of word counts in texts, segmented by sentiment:
Main Observations¶
Median words by sentiment:
- Neutral: Approximately 21 words (highest median)
- Positive: Around 18 words
- Negative: About 17 words (lowest median)
Variability:
- Neutral: Greater dispersion, with a wider interquartile range (approximately 15-28 words)
- Positive and Negative: Show similar variability, but less than neutral texts
Outliers:
- All sentiments have outliers at the upper extreme (exceptionally long texts)
- The threshold for considering a text as unusually long appears to be around 45-50 words
Interpretation¶
Neutral texts tend to be longer, possibly because they contain more technical details, explanations, or contextual information about financial matters.
Negative texts are generally more concise, which could indicate that bad news or problems are communicated more directly and specifically.
The similarity between positive and negative text distributions suggests that both sentiment types are expressed with similar levels of conciseness, but neutral content requires more elaboration.
Processing Implications¶
Text length, measured in word count, could be a useful indicator to distinguish neutral content from emotional content (positive or negative).
Preprocessing will need to adequately handle both very short texts (2-3 words) and exceptionally long ones (over 45 words), as both extremes are present across all sentiments.
For modeling, we might consider length normalization or techniques that account for these differences, especially if we want to distinguish between positive and negative texts, which are more similar to each other in terms of length.
# 4.1 Text Preprocessing Function
def preprocess_text(text):
"""
Function to preprocess text data
"""
# Convert to lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove mentions and hashtags
text = re.sub(r'@\w+|\#', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove numbers
text = re.sub(r'\d+', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
stop_words.update(['s', 't', 've', 'll', 'd', 'm']) # Add some contractions
tokens = [word for word in tokens if word not in stop_words and len(word) > 1]
# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
# Join tokens back to text
preprocessed_text = ' '.join(tokens)
return preprocessed_text
# 4.2 Apply Preprocessing
print("Preprocessing text data...")
df['cleaned_text'] = df['Sentence'].apply(preprocess_text)
print("Preprocessing complete!")
# Display examples of preprocessed text
print("\nOriginal vs Cleaned Text Examples:")
for i in range(3):
print(f"\nOriginal: {df['Sentence'].iloc[i]}")
print(f"Cleaned: {df['cleaned_text'].iloc[i]}")
Preprocessing text data... Preprocessing complete! Original vs Cleaned Text Examples: Original: The GeoSolutions technology will leverage Benefon 's GPS solutions by providing Location Based Search Technology , a Communities Platform , location relevant multimedia content and a new and powerful commercial model . Cleaned: geosolutions technology leverage benefon gps solution providing location based search technology community platform location relevant multimedia content new powerful commercial model Original: $ESI on lows, down $1.50 to $2.50 BK a real possibility Cleaned: esi low bk real possibility Original: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m . Cleaned: last quarter componenta net sale doubled eurm eurm period year earlier moved zero pretax profit pretax loss eurm
# 5.1 Word Frequency Analysis
def get_top_n_words(corpus, n=20):
"""
Gets the top n words from a corpus of text
"""
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
return words_freq[:n]
# Get top words for each sentiment class
positive_words = get_top_n_words(df[df['Sentiment'] == 'positive']['cleaned_text'], 20)
negative_words = get_top_n_words(df[df['Sentiment'] == 'negative']['cleaned_text'], 20)
if 'neutral' in df['Sentiment'].unique():
neutral_words = get_top_n_words(df[df['Sentiment'] == 'neutral']['cleaned_text'], 20)
has_neutral = True
else:
has_neutral = False
# 5.2 Visualize Top Words
# Create dataframes for visualization
positive_df = pd.DataFrame(positive_words, columns=['word', 'count'])
negative_df = pd.DataFrame(negative_words, columns=['word', 'count'])
if has_neutral:
neutral_df = pd.DataFrame(neutral_words, columns=['word', 'count'])
# Create bar charts for top words
if has_neutral:
fig, axes = plt.subplots(1, 3, figsize=(24, 8))
else:
fig, axes = plt.subplots(1, 2, figsize=(18, 8))
sns.barplot(x='count', y='word', data=positive_df, ax=axes[0], palette=['#55a868'])
axes[0].set_title('Top Words in Positive Reviews', fontsize=16)
sns.barplot(x='count', y='word', data=negative_df, ax=axes[1], palette=['#c44e52'])
axes[1].set_title('Top Words in Negative Reviews', fontsize=16)
if has_neutral:
sns.barplot(x='count', y='word', data=neutral_df, ax=axes[2], palette=['#4c72b0'])
axes[2].set_title('Top Words in Neutral Reviews', fontsize=16)
plt.tight_layout()
plt.show()
Word Frequency Analysis by Sentiment Category¶
The charts show the 20 most frequent words for each sentiment category after preprocessing (stopword removal, lemmatization, etc.).
Common Terms Across Categories¶
Interestingly, several words appear with high frequency across all three categories:
- "eur" (euro)
- "mln" (million)
- "company"
- "sale"
- "finnish" (possibly reflecting a focus on Finnish companies)
- "profit"
This suggests the dataset contains many financial texts related to European (particularly Finnish) companies and their financial results.
Distinctive Patterns by Sentiment¶
- Unique Words in Positive Texts:
- "increased"
- "rose"
These words clearly indicate growth and favorable trends.
- Unique Words in Negative Texts:
- "loss"
- "decreased"
- "compared" - possibly in negative comparison contexts
These words are associated with unfavorable financial results.
- Unique Words in Neutral Texts:
- "service"
- "business"
- "market"
- "group"
These terms tend to be more descriptive and less emotionally charged.
Model Implications¶
Context importance: Since many words appear across all categories, the model will need to learn to interpret these words in their proper context. For example, "profit" could appear in "profit increased" (positive) or "profit decreased" (negative).
Discriminative terms: Words like "loss", "increased" and "decreased" will likely be important features for the model when differentiating between sentiments.
N-gram relevance: The presence of similar words in different categories suggests that bigrams or trigrams (e.g., "profit increased" vs. "profit decreased") might be more informative than individual words.
Domain specialization: The high frequency of specific financial terms ("mln", "eur", "quarter") confirms we're working with a highly specialized dataset in the financial domain.
Potential Improvements¶
- Incorporate n-grams: Add bigrams and trigrams in our vectorization to better capture context.
- Collocation analysis: Examine which words tend to appear together in each sentiment category.
- Custom weighting: Consider assigning specific weights to key discriminative terms during vectorization.
# 6. Feature Engineering and Model Training
# Convert sentiment labels to numeric
sentiment_map = {'negative': 0, 'neutral': 1, 'positive': 2} if has_neutral else {'negative': 0, 'positive': 1}
df['sentiment_code'] = df['Sentiment'].map(sentiment_map)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
df['cleaned_text'],
df['sentiment_code'],
test_size=0.2,
random_state=42,
stratify=df['sentiment_code']
)
# Feature extraction with TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
# Display feature dimensions
print(f"Training features shape: {X_train_tfidf.shape}")
print(f"Testing features shape: {X_test_tfidf.shape}")
Training features shape: (4673, 5000) Testing features shape: (1169, 5000)
# 7.1 Model Evaluation Function
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
# Train model
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"--- {model_name} Results ---")
print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=sentiment_map.keys(),
yticklabels=sentiment_map.keys())
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title(f'Confusion Matrix - {model_name}')
plt.tight_layout()
plt.show()
return model, accuracy
# 7.2 Model 1: Multinomial Naive Bayes
nb_model = MultinomialNB()
nb_model, nb_accuracy = evaluate_model(
nb_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Naive Bayes"
)
--- Naive Bayes Results ---
Accuracy: 0.6835
Confusion Matrix:
[[ 7 119 46]
[ 3 598 25]
[ 1 176 194]]
Classification Report:
precision recall f1-score support
0 0.64 0.04 0.08 172
1 0.67 0.96 0.79 626
2 0.73 0.52 0.61 371
accuracy 0.68 1169
macro avg 0.68 0.51 0.49 1169
weighted avg 0.68 0.68 0.63 1169
Naive Bayes Model Analysis¶
The confusion matrix and evaluation metrics for Naive Bayes reveal several important aspects about the model's performance:
Performance Analysis¶
- Overall Accuracy: 68.35%
- Moderate performance that exceeds random classification, but with significant room for improvement.
- Performance by class:
Negative Class (0):
- Precision: 64% - When predicting negative, it's correct 64% of the time
- Recall: 4% - Only identifies 4% of all actual negative texts
- F1-score: 8% - Extremely low performance for this class
Neutral Class (1):
- Precision: 67% - Moderate precision for neutral predictions
- Recall: 96% - Excellent ability to identify neutral texts
- F1-score: 79% - Good overall performance for this class
Positive Class (2):
- Precision: 73% - Best precision among the three classes
- Recall: 52% - Identifies just over half of positive texts
- F1-score: 61% - Moderate performance
- Key observations from confusion matrix:
- The model classifies the vast majority of texts as neutral (893 of 1169)
- It only predicts the negative class 11 times (less than 1% of predictions)
- The biggest confusion occurs with positive texts classified as neutral (176 cases)
Identified Problems¶
Extreme bias toward neutral class: The model has a strong tendency to classify texts as neutral, explaining the high recall (96%) but moderate precision (67%) for this class.
Underrepresentation of negative class: The model practically ignores the negative class, correctly classifying only 7 of 172 negative texts.
Performance imbalance: There's a dramatic difference in recall between classes (4% vs 96% vs 52%).
Possible Improvements¶
- Prior probability adjustment (class_prior):
- Modify prior probabilities to counteract class imbalance
- For example:
MultinomialNB(class_prior=[0.33, 0.33, 0.33])to assign equal probability to each class
- Decision threshold adjustment:
- Implement a classifier with custom thresholds based on Naive Bayes probabilities
- Lower the threshold for the negative class to increase its recall
- Resampling techniques:
- Oversampling the minority class (negative) using techniques like SMOTE
- Undersampling the majority class (neutral)
- Feature engineering:
- Incorporate bigrams and trigrams to better capture context (e.g., "not good" vs "good")
- Experiment with weighting schemes like TF-IDF with different parameters
- Add features based on financial sentiment lexicons
- Consider alternative models:
- While Naive Bayes is efficient, other models like SVM or Random Forest might better handle the imbalance
- Ensemble of multiple classifiers to improve overall performance
- Naive Bayes-specific techniques:
- Adjust the alpha parameter (Laplace smoothing) to find the optimal value
- Try different Naive Bayes variants (Bernoulli NB instead of Multinomial)
The main challenge is improving the detection of the negative class without sacrificing too much performance in the other classes. A combined approach of parameter tuning and class rebalancing will likely yield the best results.
# 7.3 Model 2: Logistic Regression
lr_model = LogisticRegression(max_iter=1000, C=1.0, solver='lbfgs', n_jobs=-1)
lr_model, lr_accuracy = evaluate_model(
lr_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Logistic Regression"
)
--- Logistic Regression Results ---
Accuracy: 0.6929
Confusion Matrix:
[[ 23 116 33]
[ 30 556 40]
[ 8 132 231]]
Classification Report:
precision recall f1-score support
0 0.38 0.13 0.20 172
1 0.69 0.89 0.78 626
2 0.76 0.62 0.68 371
accuracy 0.69 1169
macro avg 0.61 0.55 0.55 1169
weighted avg 0.67 0.69 0.66 1169
Logistic Regression Model Analysis¶
The logistic regression model shows improvement over Naive Bayes, with more balanced performance across classes. Let's analyze the results in detail:
Overall Evaluation¶
- Overall Accuracy: 69.29%
- Slightly higher than Naive Bayes (68.35%)
- Performance by class:
Negative Class (0):
- Precision: 38% - Lower than Naive Bayes (64%)
- Recall: 13% - Significant improvement over Naive Bayes (4%)
- F1-score: 20% - Better balance than the previous model (8%)
Neutral Class (1):
- Precision: 69% - Slightly better than Naive Bayes
- Recall: 89% - Good, though lower than Naive Bayes (96%)
- F1-score: 78% - Similar to the previous model
Positive Class (2):
- Precision: 76% - Slightly higher than the previous model
- Recall: 62% - Significant improvement over Naive Bayes (52%)
- F1-score: 68% - Better than the previous model (61%)
- Confusion matrix analysis:
- Less extreme bias toward the neutral class (804 predictions vs 893 in NB)
- Better balance in negative predictions (61 vs 11 in NB)
- Improved identification of positive texts (231 correct vs 194 in NB)
Observed Improvements¶
Better class balance: Logistic regression distributes its predictions better among the three classes, though still showing a tendency toward the majority class (neutral).
Improvement in negative class: Though still the weakest point, detection of negative sentiment improved significantly (from 4% to 13% recall).
Higher overall precision: Both overall accuracy and class-specific metrics improved, especially the weighted F1-score (0.66 vs 0.63).
Additional Potential Improvements¶
- Decision threshold adjustment:
- Calibrate the logistic regression probabilities to further improve class balance
- Implement a custom decision scheme that favors the minority class (negative)
- Advanced feature engineering:
- Incorporate specific features for detecting negative sentiment
- Explore longer n-grams to better capture negative expressions
- Include features based on financial sentiment lexicons
- Hyperparameter tuning:
- Optimize the regularization parameter C
- Try different solvers (liblinear, saga, etc.)
- Implement more aggressive class weighting with
class_weight='balanced'or custom values
- Resampling techniques:
- Apply techniques like SMOTE or ADASYN specifically for the minority class
- Consider undersampling the majority class combined with oversampling the minority class
Conclusion¶
Logistic regression shows better overall balance than Naive Bayes, especially in its ability to detect the minority class (negative texts). While there's still room for improvement, this model represents a significant advancement in terms of balanced metrics across classes.
It would be advisable to optimize this model through calibration techniques and hyperparameter tuning before exploring more complex models like Random Forest or SVM.
# 7.4 Model 3: Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model, rf_accuracy = evaluate_model(
rf_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Random Forest"
)
--- Random Forest Results ---
Accuracy: 0.6330
Confusion Matrix:
[[ 19 122 31]
[ 84 501 41]
[ 11 140 220]]
Classification Report:
precision recall f1-score support
0 0.17 0.11 0.13 172
1 0.66 0.80 0.72 626
2 0.75 0.59 0.66 371
accuracy 0.63 1169
macro avg 0.53 0.50 0.51 1169
weighted avg 0.62 0.63 0.62 1169
Random Forest Model Analysis¶
The Random Forest model shows different behavior compared to previous models, with notable advantages and disadvantages:
Overall Evaluation¶
- Overall Accuracy: 63.30%
- Lower than previous models (NB: 68.35%, LR: 69.29%)
- Performance by class:
Negative Class (0):
- Precision: 17% - Significantly lower than Naive Bayes (64%) and Logistic Regression (38%)
- Recall: 11% - Better than Naive Bayes (4%) but inferior to Logistic Regression (13%)
- F1-score: 13% - Worse than Logistic Regression (20%) but better than Naive Bayes (8%)
Neutral Class (1):
- Precision: 66% - Lowest of the three models
- Recall: 80% - Lower than Naive Bayes (96%) and Logistic Regression (89%)
- F1-score: 72% - Lowest of all three models for this class
Positive Class (2):
- Precision: 75% - Comparable to Logistic Regression (76%)
- Recall: 59% - Between Naive Bayes (52%) and Logistic Regression (62%)
- F1-score: 66% - Slightly lower than Logistic Regression (68%)
- Confusion matrix analysis:
- Higher number of negative predictions (114 vs 61 in LR and 11 in NB)
- Less tendency to predict the neutral class (763 vs 804 in LR and 893 in NB)
- Significant confusion between neutral and negative classes (84 neutral texts classified as negative)
Key Observations¶
More class distribution: Random Forest attempts to balance predictions more across the three classes, which reduces its overall accuracy but potentially offers a more nuanced view.
Weakness in precision for negative class: While it identifies more texts as negative, many are false positives (84 neutral + 11 positive), explaining the low precision (17%).
More conservative with neutral class: Less tendency to predict neutral compared to other models, which reduces recall but potentially offers higher specificity.
Why is Performance Lower?¶
Possible overfitting: Random Forest may be overfitting to specific features of the training set that don't generalize well.
Noisy features: The model might be giving importance to words or patterns that aren't truly discriminative for sentiment.
Class imbalance: Despite its ability to handle imbalanced data, this Random Forest appears affected by the imbalance, especially in the negative class.
Potential Improvements¶
- Hyperparameter tuning:
- Modify
max_depthto control overfitting - Increase
n_estimatorsto improve generalization - Adjust
min_samples_leafto prevent overly specific splits
- Class weighting:
- Implement
class_weight='balanced'or custom weights - Experiment with built-in undersampling techniques
- Feature selection:
- Use Random Forest feature importance to eliminate noisy features
- Implement recursive feature elimination
- Ensemble with other models:
- Combine Random Forest predictions with Logistic Regression through voting or stacking
- Use Random Forest to generate features for a final Logistic Regression model
Conclusion¶
Random Forest shows interesting behavior in attempting to balance predictions more across classes, but its overall performance is inferior to Logistic Regression. However, it could be valuable as part of a model ensemble or after more extensive hyperparameter tuning.
Based on current results, Logistic Regression remains the most effective model among the three evaluated, offering the best balance between overall accuracy and performance for each class.
# 8. Model Comparison
model_comparison = pd.DataFrame({
'Model': ['Naive Bayes', 'Logistic Regression', 'Random Forest'],
'Accuracy': [nb_accuracy, lr_accuracy, rf_accuracy]
})
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='Model', y='Accuracy', data=model_comparison, palette='viridis')
plt.title('Model Accuracy Comparison', fontsize=16)
plt.ylim(0, 1.0)
# Add accuracy values on top of bars
for p in ax.patches:
ax.annotate(f'{p.get_height():.4f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'bottom',
fontsize=12)
plt.tight_layout()
plt.show()
# 9. Feature Importance (for the best model)
if lr_accuracy >= nb_accuracy and lr_accuracy >= rf_accuracy:
best_model = "Logistic Regression"
# Get feature importance from Logistic Regression
feature_importance = pd.DataFrame({
'feature': tfidf_vectorizer.get_feature_names_out(),
'importance': lr_model.coef_[0] if not has_neutral else lr_model.coef_.mean(axis=0)
})
elif rf_accuracy >= nb_accuracy and rf_accuracy >= lr_accuracy:
best_model = "Random Forest"
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
'feature': tfidf_vectorizer.get_feature_names_out(),
'importance': rf_model.feature_importances_
})
else:
best_model = "Naive Bayes"
# For Naive Bayes, we can use the log probabilities
feature_importance = pd.DataFrame({
'feature': tfidf_vectorizer.get_feature_names_out(),
'importance': np.abs(nb_model.feature_log_prob_[1] - nb_model.feature_log_prob_[0])
})
# Sort features by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)
# Plot top 20 important features
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
plt.title(f'Top 20 Important Features ({best_model})', fontsize=16)
plt.tight_layout()
plt.show()
# 10.1 Sample Prediction Function
def predict_sentiment(text, model, vectorizer, sentiment_map_inv):
# Preprocess the text
cleaned = preprocess_text(text)
# Vectorize
text_vectorized = vectorizer.transform([cleaned])
# Predict
prediction = model.predict(text_vectorized)[0]
proba = model.predict_proba(text_vectorized)[0]
# Map back to sentiment label
sentiment = sentiment_map_inv[prediction]
return sentiment, proba
# Create inverse mapping
sentiment_map_inv = {v: k for k, v in sentiment_map.items()}
# Find the best model
if lr_accuracy >= nb_accuracy and lr_accuracy >= rf_accuracy:
best_model_obj = lr_model
best_model_name = "Logistic Regression"
elif rf_accuracy >= nb_accuracy and rf_accuracy >= lr_accuracy:
best_model_obj = rf_model
best_model_name = "Random Forest"
else:
best_model_obj = nb_model
best_model_name = "Naive Bayes"
# 10.2 Sample Predictions
sample_texts = [
"The company reported strong earnings, beating analyst expectations with record revenue.",
"The stock plummeted after the company announced significant losses in the last quarter.",
"The market remained stable today with minor fluctuations across major indices."
]
print("\nSample Predictions:")
for text in sample_texts:
sentiment, proba = predict_sentiment(text, best_model_obj, tfidf_vectorizer, sentiment_map_inv)
print(f"\nText: {text}")
print(f"Predicted Sentiment: {sentiment}")
print("Class Probabilities:")
for i, label in sentiment_map_inv.items():
print(f" {label}: {proba[i]:.4f}")
Sample Predictions: Text: The company reported strong earnings, beating analyst expectations with record revenue. Predicted Sentiment: positive Class Probabilities: negative: 0.0516 neutral: 0.1593 positive: 0.7891 Text: The stock plummeted after the company announced significant losses in the last quarter. Predicted Sentiment: positive Class Probabilities: negative: 0.1384 neutral: 0.3055 positive: 0.5561 Text: The market remained stable today with minor fluctuations across major indices. Predicted Sentiment: neutral Class Probabilities: negative: 0.1749 neutral: 0.5133 positive: 0.3118
Feature Importance and Prediction Analysis¶
Most Influential Features (Logistic Regression)¶
The feature importance chart reveals significant patterns about which words have the greatest impact on sentiment classification:
Trend indicators:
- Words like "long", "increased", "grew" and "expands" have high importance, suggesting that terms related to growth are strong predictors of positive sentiment.
- "Lower" and "short" also appear as highly influential, likely as indicators of negative sentiment.
Corporate activity terms:
- "Signed", "buy", "acquisition" and "option" have significant importance, indicating that corporate activities like acquisitions and agreements carry weight in sentiment determination.
Financial state descriptors:
- "Strong" appears as an important feature, clearly indicating positive sentiment.
- "Total" and "price" also considerably influence predictions.
Sample Prediction Analysis¶
The sample predictions reveal both strengths and limitations of the model:
Correct positive case:
- "The company reported strong earnings, beating analyst expectations with record revenue."
- Prediction: Positive (78.9% confidence)
- The model correctly identifies positive terms like "strong earnings" and "beating expectations".
Misclassified negative case:
- "The stock plummeted after the company announced significant losses in the last quarter."
- Incorrect prediction: Positive (55.6% confidence)
- Problem detected: The model doesn't adequately capture the negative impact of words like "plummeted" and "losses", possibly because these terms didn't appear frequently enough in the training data.
Correctly classified neutral case:
- "The market remained stable today with minor fluctuations across major indices."
- Prediction: Neutral (51.3% confidence)
- The model correctly recognizes neutral descriptive language without strong positive or negative connotations.
These examples confirm that the model works reasonably well with positive and neutral texts but has significant difficulties with negative texts, which aligns with the observations from the confusion matrices.
Conclusions¶
Logistic Regression was the best-performing model, achieving an accuracy of 0.6929 on the test set, outperforming both Naive Bayes (0.6835) and Random Forest (0.6330).
Key Findings¶
Class imbalance: The dataset exhibits a strong imbalance (53.7% neutral, 31.8% positive, 14.5% negative), which significantly impacts model performance, especially for the minority class (negative).
Linguistic differences: We found distinct linguistic patterns between sentiment categories:
- Negative texts: Shorter and more direct, but difficult to identify due to their lower representation.
- Neutral texts: Longer, with descriptive and technical language.
- Positive texts: Moderate length, with specific terms related to growth and success.
Performance by class: All models showed uneven behavior:
- Excellent detection of neutral texts (recall ~80-96%)
- Good identification of positive texts (recall ~52-62%)
- Poor recognition of negative texts (recall ~4-13%)
Predictive terms: Words related to directionality ("long", "lower", "increased"), corporate activity ("signed", "buy", "acquisition") and financial valuation ("strong") proved to be the most powerful predictors.
Practical Applications¶
This sentiment analysis model can be valuable for:
- Market monitoring: Automated analysis of financial news to detect sentiment trends.
- Competitive intelligence: Tracking sentiment around competitors or specific sectors.
- Risk management: Early detection of sentiment changes that might indicate emerging risks.
- Investment assistance: Complement to fundamental and technical analyses with market sentiment data.
- Investor relations: Evaluation of corporate communication reception.
Limitations and Considerations¶
- Weakness in negative sentiment: The model has significant difficulties identifying negative texts, which could result in missing important alerts.
- Context and subtlety: The word-based approach may miss important nuances that require contextual understanding.
- Temporal bias: The model is trained on historical data and might not adapt to changes in financial language.
Next Steps¶
Model Improvements:
- Implement class balancing techniques (SMOTE, class_weight)
- Incorporate n-gram features to better capture context
- Experiment with more advanced models like BERT or FinBERT (specific for finance)
- Develop an ensemble that leverages the strengths of different models
Application Development:
- Create an API for real-time sentiment analysis
- Develop a dashboard for trend monitoring
- Implement automated alerts for significant sentiment changes
Additional Analyses:
- Examine sentiment trends by sector or industry
- Correlate sentiment with subsequent market movements
- Analyze geographic variations in financial sentiment
- Study the temporal evolution of sentiment during economic events