当前位置: 首页 > news >正文

P.4文本统计工具

一、功能

读取指定文本文件,统计字符数(含/不含空格和标点符号)、单词数、行数、高频词(TOP10)。

二、训练重点

ifstream大文件读取、string的遍历与处理、map<string, int>统计词频、对map值排序、STL算法(count/replace),过滤停用词(is/a/an/the等等)、支持多文件统计、输出统计结果到新文件。

三、我的思路

将任务分为两部分:(1)读取文本;(2)将单词出现频率从高到低排序;(3)输出统计结果。最后的执行函数run也是按照这个顺序进行的。

1)读取文本

  1. 使用getline一行一行地读取,读取一行字符串text后,进行分析。
  2. 先将这一行字符串所有大写字母转换为小写形式,再遍历字符串:
  • if (text[idx] == (空格 || 标点符号)) idx++,跳过该字符;
  • else text[i] == 单词,使用read_word函数读取这个单词:
    • 如果读取到的单词在要过滤的名单中(过滤名单见代码),则不予理睬
    • 否则将该单词出现次数++

2)将单词出现频率从高到低排序

这一步我直接使用STL中的sort了。

3)输出统计结果

共5+n行(n == 高频词数量num >= 10 ? 10 : num),前4行分别是:字符数(含空格和标点符号)、字符数(不含空格和标点符号)、单词数、行数。第五行是High-frequency words(Top 10),接下来10行是高频词 Top 10(若不足10个则按实际个数输出)。

四、注意事项

  1. 代码需要在C++ 20及以上才能运行(因为使用了ranges头文件范围库及相关操作);
  2. 读取的文本仅支持全英文文本,不能包含任何中文(中文统计功能本来是想加入的,但是我能力不够);
  3. 读取的文本的每一行的末尾不能包含不完整的单词,即不能有使用连字符将本行放不下的单词放在下一行的情况,否则读取的单词不完整;
  4. 不能识别复合词
  5. 代码中可能有未知Bug和可以优化的地方,如果您发现了,希望您能告知作者,因为作者是在读大学生,希望提高自己的能力。

五、演示示例

文章(不知道什么原因,一个段落都显示到一行中了):

Food and Health: The Foundation of a Good Life
"You are what you eat" is an old saying that holds more truth today than ever before. Our diet is not just about satisfying hunger—it directly shapes our physical health, energy levels, mental clarity, and even our long-term lifespan. In a world dominated by fast food and ultra-processed snacks, understanding the connection between what we eat and how we feel has become essential.
A balanced diet is the cornerstone of good health. It does not mean strict restrictions or giving up all the foods we love. Instead, it means eating a variety of nutrient-dense foods in the right proportions. This includes plenty of colorful fruits and vegetables, which are packed with vitamins, minerals, and antioxidants that protect our bodies from diseases. Whole grains like brown rice and oats provide sustained energy, while lean proteins such as fish, chicken, and beans help build and repair our muscles. Healthy fats from nuts, avocados, and olive oil are crucial for brain function and heart health.
Unfortunately, modern lifestyles have led many people to rely heavily on processed foods. These convenient options are often loaded with added sugars, salt, and unhealthy trans fats, but lack essential nutrients. Regular consumption of these foods can lead to a range of health problems, including obesity, heart disease, type 2 diabetes, and high blood pressure. They can also cause energy crashes, leaving us feeling tired and sluggish throughout the day.
What many people do not realize is that diet also has a profound impact on our mental health. Research has shown that a diet rich in whole foods can reduce the risk of depression and anxiety. The gut is often called our "second brain," and the food we eat affects the production of neurotransmitters like serotonin, which regulates our mood. On the contrary, a diet high in sugar and processed foods can disrupt this balance and worsen mood swings.
Making small, sustainable changes to your eating habits can have a huge impact on your overall health. Start by adding one extra serving of vegetables to your meals each day. Swap sugary drinks for water or herbal tea. Try cooking at home more often, so you can control the ingredients in your food. Remember, healthy eating is a journey, not a destination. It is okay to enjoy your favorite treats occasionally, as long as you maintain balance most of the time.
In conclusion, the food we choose to eat is one of the most powerful tools we have for taking care of our health. By nourishing our bodies with wholesome foods, we can increase our energy, improve our mood, and reduce the risk of chronic diseases. A healthy diet is not just about living longer—it is about living better.

输出的统计结果:

character count including spaces: 3193
character count excluding spaces: 2212
Totally words: 151
Totally lines: 7
High-frequency words(Top 10)
our: 12
health: 8
foods: 7
diet: 6
food: 5
not: 5
your: 5
energy: 4
eat: 4
healthy: 3

六、代码

#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <stdexcept>
#include <unordered_set>
#include <ranges>
using namespace std;class Text_Statistics {
private: // 类成员变量int chars_with_spaces, chars_without_spaces;	// 字符数(含/不含空格与标点符号)map<string, int> words_frequency;				// 词频vector<pair<string, int>> high_frequency_words;	// 高频词int word_count, line_count;						// 单词数目、文本行数unordered_set<char> punctuation;				// 空格与标点符号private: // 辅助函数/*----- 获得单词 -----*/void read_word(const string& text, int& i){const int SIZE = text.size();string word;while (i < SIZE){chars_with_spaces++;if (punctuation.count(text[i])) break; // 遇见空格或标点符号word += text[i];i++;chars_without_spaces++;}if (filter(word)) words_frequency[word]++; // 添加if (words_frequency.count(word)) word_count++;}/*----- 过滤器 -----*/bool filter(const string& word){// 过滤空字符串if (word.empty()) return false;// 过滤常见虚词static const unordered_set<string> stop_words = {"a", "an", "the", "am", "is", "are", "was", "were","be", "been", "being", "have", "has", "had", "do", "does", "did","will", "would", "shall", "should", "may", "might", "can", "could","of", "in", "on", "at", "to", "for", "with", "by", "about", "as","but", "and", "or", "so", "if", "because", "when", "where", "which","that", "this", "these", "those", "he", "she", "it", "we", "you", "they","i", "me"};return !stop_words.count(word);}/*----- 词频排序 -----*/void sort_by_value(){high_frequency_words.assign(words_frequency.begin(), words_frequency.end());sort(high_frequency_words.begin(), high_frequency_words.end(), [](const pair<string, int>& a, const pair<string, int>& b){return a.second > b.second;});}private: // 关键函数/*----- 加载文本 -----*/void load_from_local(){ios::sync_with_stdio(false);cin.tie(nullptr);ifstream ifs;ifs.open("Text.txt", ios::in);if (!ifs.is_open()) throw runtime_error("Error: Failed to open file 'data.txt'!");string buf;while (getline(ifs, buf)){line_count++;int idx = 0;const int SIZE = buf.size();// 将所有大写字母转为小写auto lowerView = buf | views::transform([](unsigned char c){return tolower(c);});string lowerStr(lowerView.begin(), lowerView.end());// 读取单词while (idx < SIZE){const char character = buf[idx];// 碰到空格或标点符号if (punctuation.count(character)){idx++;chars_with_spaces++;}else read_word(lowerStr, idx);}}ifs.close();}/*----- 保存统计结果 -----*/void save_to_local(){ofstream ofs;ofs.open("Result.txt", ios::out);ofs << "character count including spaces: " << chars_with_spaces << endl;		// 字符数(含空格与标点符号)ofs << "character count excluding spaces: " << chars_without_spaces << endl;	// 字符数(不含空格与标点符号)ofs << "Totally words: " << word_count << endl;	// 单词数(包括被过滤了的单词)ofs << "Totally lines: " << line_count << endl;	// 文本行数ofs << "High-frequency words(Top 10)" << endl;	// 高频词(前10)for (const auto& p : high_frequency_words | views::take(10)){ofs << p.first << ": " << p.second << endl;}ofs.close();}public:/*----- 构造函数 -----*/Text_Statistics(){chars_with_spaces = chars_without_spaces = 0;word_count = line_count = 0;// 添加标点符号for (int i = 32; i <= 47; i++) punctuation.insert(i);for (int i = 58; i <= 64; i++) punctuation.insert(i);for (int i = 91; i <= 96; i++) punctuation.insert(i);for (int i = 123; i <= 126; i++) punctuation.insert(i);}/*----- 接口 -----*/void run(){load_from_local();	// 先加载sort_by_value();	// 再排序save_to_local();	// 再输出}
};int main()
{try{Text_Statistics text_statistics;text_statistics.run();cout << "Statistics completed successfully!" << endl << "Results saved to Result.txt!" << endl;}catch (const runtime_error& e){cerr << e.what() << endl;return 1;}return 0;
}
http://www.jsqmd.com/news/886625/

相关文章:

  • HDI 高密度互连板阶数的深度理解
  • 运维必看:CentOS7开机全链路分析+root密码/引导故障急救方案
  • 构建高安全本地智能家居:基于MQTT over TLS与双向认证的实践
  • 2026年老面小笼包面粉怎么挑?五大品牌发酵力与出品表现横评 - 科技焦点
  • 黑盒模型数据最小化合规审计:对抗性攻击视角下的隐私风险度量
  • 炉石传说脚本终极指南:智能自动对战助手完整教程
  • DeepSeek重构模式推荐:为什么92%的团队在RAG升级中选错模式?3个被忽略的上下文耦合指标
  • 别被忽悠了!2026亲测靠谱的AI论文网站|避坑精选版
  • 15事件警报:告警机制的设计案例
  • YOLOv11医院病房医护人员目标检测数据集-579张-doc-nurse--1
  • 02-大模型AI:AI大模型应用中的关键术语解析
  • 做老面小笼包怕翻车?2026五大面粉品牌品控稳定性与口碑实测 - 科技焦点
  • 区块链共识机制基础知识
  • YOLO26涨点改进| TPAMI 2025 | 独家创新首发、注意力改进篇| 引入TMSA泰勒展开多头自注意力新范式,含二次创新多种改进点,助力目标检测、图像分割、遥感目标检测、图像修复任务涨点
  • 【深度解析】AI Coding 模型竞速:从 Claude Mythos 安全编码到 GPT-5.6 传闻,如何落地代码审查智能体
  • Mysql:事务管理(中)
  • 告别Cygwin:在Windows 11的WSL2上轻松部署UCSF DOCK 6.11完整环境
  • 探索Windows 11 LTSC系统商店恢复的模块化解决方案:智能部署实战
  • 从Windows API调用到硬盘读写:一次‘读文件’请求的完整I/O栈之旅(含图解)
  • 股票买卖最佳时机:LeetCode121题解
  • 339商业模式介绍(代码)
  • 2026年老面小笼包用面粉哪家品质更稳:批次稳定性、品控标准与耐发酵表现深度解析 - 科技焦点
  • 程序员的自我修养:链接、装载与库(库)
  • VideoDownloadHelper 插件深度解析:Chrome 视频下载架构设计与技术实现
  • 告别抓瞎调试!手把手教你用格西调试精灵搞定IEC60870-5-102协议测试
  • AI圈神秘领袖Ilya一幅画引爆全网,OpenAI三件大事暗示AGI时代将至?
  • TP、FP、FN、TN 详解
  • 一文吃透Linux防火墙:firewalld+SELinux完整防护实操指南
  • 科华UPS电源全品类汇总:选型与场景适配指南
  • HDI与普通PCB的叠层差异