A Guide to Writing Python Batch-Processing Scripts

For developers who need to process large amounts of data, it explains how to write efficient batch labeling scripts using the Python programming language.It includes practical techniques such as data pre-processing, parallel computation and memory optimization, to help users deal with the task of processing more than 100,000 data points per day. This improves efficiency and reduces resource consumption.

Why did you choose Python for processing batch labeling tasks?

If the thought of mountains of data to be annotated makes you feel anxious, Python is the solution.Compared to other languages, it has a rich third party library (such as Pandas, NumPy), and the simple syntax makes batch operations as easy as putting together building blocks.What is more important is that Python can flexibly adapt to various data formats when handling common annotation scenes like text and images. This is especially friendly to projects that require frequent adjustments to annotation rules.

The key steps in designing a script for a real-life situation.

Data is read in blocks to avoid an explosion of memory.

If you try to read 100,000 lines of data directly into memory, you’ll be stuck for minutes wondering what’s wrong. It is suggested that you use the chunksize parameter of Pandas to load the data in chunks, or use a generator to process the data line by line.For instance, when dealing with CSV files, the program can process them while reading, so the memory usage stays stable.

Multithreading / multiprocessing acceleration techniques.

A single-threaded process is like drinking milk tea through a straw--too slow! Using the concurrent.futures module to open multiple threads of work directly doubles the speed.But it is important to note that I/O-intensive tasks should use multi-threading, while calculation-intensive tasks should use multi-processing. Choosing the wrong mode will actually make things slower.

Exception handling and logging.

The worst thing that could happen when running a script in the middle of the night is to encounter an error that hasn't been recorded. You've got to use try-except to catch the error, and at the same time use the logging module to record the line number and data content.I suggest that the program automatically save the progress every 1000 lines. That way, the program can continue from the point where it was interrupted.

Hidden techniques for optimizing performance.

Vectorization replaces iteration.

If you can use Pandas 'apply, don't write a for loop! For example, when adding a label to a column of data, df ['tag'] = df ['content']. apply (label_rule) is more than five times faster than processing the data row by row.

The use of caching to reduce the amount of recalculation.

When it is necessary to query an external database, functools.lru_cache is used to cache the query results.In one test, caching improved processing speed by 40 %.

Choose an appropriate data structure.

Lists and dictionaries aren't omnipotent! If you're going to do frequent lookups while labeling data, try storing the label rules in a set or using nested dictionaries.

Case study: E-commerce and the battle for ratings.

Recently I helped a friend process 150,000 online reviews for e-commerce sites. He wanted us to label them according to three categories: “ quality, ” “ logistics, ” and “ service.After loading the data into Pandas, the comments are tokenized and matched against a keyword pool using regular expressions. Then multiprocessing is used to open four parallel processes.Originally estimated to take eight hours, the task was completed in just two, with memory use never exceeding 2GB.

(P.S. If you are not using the latest hardware, we suggest that you first use 10,000 records of data to test the script, adjust the parameters, and then use the full amount of data.