The complete path from data collection to visualization

Starting from the actual needs of users, the book dissects the entire process of data collection, cleaning and processing, and visualization, covering common methods of data collection, high-efficiency processing techniques, and recommended visualization tools. This helps users quickly realize the value of their data and present it to others.

Why is a complete path needed?

When many people first begin to use data analysis, they often feel that "collecting data is troublesome," "processing it is a headache," or "I can't make a nice graph.As long as you follow through with the process and hit the right steps at the right time, it's not nearly as complicated as it seems.Today we're going to break it down and tell you how to turn data into intuitive charts, and what pitfalls to avoid along the way.

The three main techniques for collecting data.

Manual entry is suitable for small amounts of data.

For example, collecting questionnaires from users or compiling sales records from Excel spreadsheets.It may take a bit more effort, but the advantage is that it is flexible and controllable, and is particularly suitable for small projects for beginners.

(Chang Yi-chin / tr.

Octoparse and WebScraper are tools that allow you to scrape web data without writing code.Recently I helped a friend extract pricing data from an e-commerce platform. It took me half an hour to do a job that used to take me an entire day.

High-end play with interfaces.

For example, you can use the Requests library in Python to access public APIs, which enables you to get weather forecasts, stock market quotes, and other dynamic data at set times.I remember last year when I was doing the visualization of the epidemic data, I relied on this for real-time updates on confirmed cases in different areas.

Data cleansing: The critical step.

We cannot afford to miss any.

Have you ever encountered a situation where the same mobile phone number appeared ten or more times on a customer list? With Excel's delete duplicate function, or Pandas 'drop_duplicate ()' method, you can solve the problem in a minute or two.

Standardization is very important.

Some write the date as 2023-08-01, and some write it as 8/1/23. This kind of disorderly format can make subsequent analysis a mess.It is recommended that the Python datetime module be used to convert dates into a standard form, or that Excel's Text to Columns function be used.

Dealing with outliers is a tricky business.

Last week, when we were analyzing the ages of our subscribers, we found that one person was 200 years old. Obviously, that was a data entry error.In this case, don't just delete it right away. First check with the business department to see if it is a data error or a special mark.

Let the Data Speak for Itself: Practical Techniques of Data Visualization

Choosing the right chart is half the battle.

Bar graphs are used for comparative data, line graphs for trends, and pie charts for percentages.Recently I used Power BI to compare sales regions, and the dynamic mapping feature allowed the boss to see at a glance the distribution of the market.

Color schemes are a minefield.

Have you ever seen a disaster scene caused by the use of a red-green color scheme in a financial chart? I recommend using the color-blind-friendly color schemes included in Tableau, or going to Adobe Color to find professionally designed color schemes.

Interaction design enhances the experience.

Interactive reports made with ECharts let users drill down into the data themselves, making them much more useful than static reports.The last time I made a dashboard for the operation team, adding a time axis to filter data doubled their analysis efficiency.

Answers to common questions.

Recently I have received a lot of private messages asking, "With all these tools, how do you choose?" Based on my own experience, I would say that a beginner should start with Excel + Power BI, and then switch to Python + Tableau when the amount of data to be processed increases.The key is to get the entire process running first, and then to optimize the details.A friend asked me if I needed to learn programming. Actually, many tools can be operated with zero programming, but knowing a bit of Python can really make data processing more flexible.