当前位置：首页 > news >正文

Python自然语言处理（NLP）入门

news 2026/7/1 8:36:52

自然语言处理（Natural Language Processing，简称NLP）是人工智能领域的一个重要分支，它致力于使计算机能够理解和处理人类语言。Python因其简洁的语法和强大的库支持，成为了自然语言处理的首选语言之一。今天，我们就来一起探索Python自然语言处理的入门知识，开启你的NLP之旅。

一、NLP的应用场景

在开始学习之前，让我们先了解一下NLP的应用场景，这将帮助你更好地理解NLP的重要性和实用性。

机器翻译：将一种语言的文本自动翻译成另一种语言，例如Google翻译。
情感分析：分析文本中的情感倾向，判断是正面、负面还是中性，常用于社交媒体监控和市场分析。
语音识别：将语音转换为文本，例如智能助手（Siri、Alexa）。
文本生成：自动生成文本，如聊天机器人、文章生成器等。
信息提取：从大量文本中提取关键信息，如人名、地名、日期等。
问答系统：自动回答用户的问题，如智能客服。

二、Python NLP的常用库

Python拥有多个强大的库，用于支持自然语言处理任务。以下是一些常用的库：

NLTK（Natural Language Toolkit）：一个领先的平台，用于构建Python程序以处理人类语言数据。
spaCy：一个开源的NLP库，用于高级NLP任务，如实体识别、永乐视频依存句法分析等。
TextBlob：一个简单易用的库，用于处理文本数据，提供了情感分析等功能。
Gensim：主要用于主题建模和文档相似性分析。
Transformers：由Hugging Face开发，提供了预训练模型，如BERT、GPT等，用于各种NLP任务。

三、安装必要的库

在开始之前，确保你已经安装了这些库。可以通过以下命令安装：

pip install nltk spacy textblob gensim transformers

对于spaCy，你还需要下载语言模型：

python -m spacy download en_core_web_sm

四、NLP入门：基本任务

（一）文本预处理

文本预处理是NLP中的一个重要步骤，它包括去除噪声追剧、标准化文本等操作。

1. 分词（Tokenization）

将文本分割成单词或句子。

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize# 下载NLTK数据包
nltk.download('punkt')text = "Hello, world! This is a test. Natural language processing is fun."# 分词
words = word_tokenize(text)
print(words)  # 输出：['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.', 'Natural', 'language', 'processing', 'is', 'fun', '.']# 分句
sentences = sent_tokenize(text)
print(sentences)  # 输出：['Hello, world!', 'This is a test.', 'Natural language processing is fun.']

2. 去除停用词（Stop Words Removal）

停用词是文本中频繁出现但对文本意义贡献不大的词，如“的”、“是”等。

from nltk.corpus import stopwords
nltk.download('stopwords')stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)  # 输出：['Hello', ',', 'world', '!', 'This', 'test', '.', 'Natural', 'language', 'processing', 'fun', '.']

3. 词干提取（Stemming）和词形还原（Lemmatization）

词干提取是将单词还原到其基本形式，而词形还原是将单词还原到其词典形式。

from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)  # 输出：['Hello', ',', 'world', '!', 'This', 'test', '.', 'Natur', 'languag', 'proces', 'fun', '.']lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized_words)  # 输出：['Hello', ',', 'world', '!', 'This', 'test', '.', 'Natural', 'language', 'processing', 'fun', '.']

（二）情感分析

情感分析是NLP中的一个重要应用，用于判断文本的情感注视影视倾向。

from textblob import TextBlobtext = "I love this product! It is amazing."
blob = TextBlob(text)# 获取情感分析结果
sentiment = blob.sentiment
print(sentiment)  # 输出：Sentiment(polarity=0.9, subjectivity=0.9)

（三）命名实体识别（NER）

命名实体识别是识别文本中的命名实体，如人名、嘀嗒影视地名、日期等。

import spacy# 加载spaCy模型
nlp = spacy.load('en_core_web_sm')text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)# 提取命名实体
for ent in doc.ents:print(ent.text, ent.label_)  # 输出：Apple ORG, U.K. GPE, $1 billion MONEY

（四）文本分类

文本分类是将文本分配到预定义的类别中。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB# 准备数据
texts = ["I love this product!", "This is a bad product.", "I am very happy.", "I am sad."]
labels = [1, 0, 1, 0]  # 1表示正面，0表示负面# 文本向量化
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)# 训练模型
model = MultinomialNB()
model.fit(X, labels)# 预测新文本
new_text = ["I am very excited."]
new_X = vectorizer.transform(new_text)
prediction = model.predict(new_X)
print(prediction)  # 输出：[1]