Many developers say that since Python/Pandas, Excel has not been used much, and it is very fast to process and visualize tables.
Let me give you a few examples.
1. Delete duplicate and empty lines
We use dict.fromkeys directly to convert the current data into a dictionary. The default value is None, because it is not used, so it doesn't matter. Then we use list to convert the result directly to list.
In [135]: for row in rows4: print(row) ('name', 'address') ('tom li', 'beijing') ('tom li', 'beijing') ('',) ('mary wang', 'shandong') ('mary wang', 'shandong') ('',) ('de8ug', 'guangzhou') In [148]: dict.fromkeys(rows4) Out[148]: {('name', 'address'): None, ('tom li', 'beijing'): None, ('',): None, ('mary wang', 'shandong'): None, ('de8ug', 'guangzhou'): None} In [137]: list(dict.fromkeys(rows4)) Out[137]: [('name', 'address'), ('tom li', 'beijing'), ('',), ('mary wang', 'shandong'), ('de8ug', 'guangzhou')]
At this time, duplicate data is removed directly. Note that our dict here is a new version of Python 3, so the order is not affected. If you are still using Python 2 or Python 3.5 or below, it is recommended to upgrade the python version.
Next, the empty data is processed. Observation (',) is a tuple, the first position of the data is an empty string, then the overall length of 1, can be directly removed through the loop. Here we can use the grammatical sugar in Python to write the loop directly in one line. Finally, we add a judgment that only the length is greater than 1. Finally, we use list to convert to list.
In [179]: list(x for x in dict.fromkeys(rows4) if len(x[0])>1) Out[179]: [('name', 'address'), ('tom li', 'beijing'), ('mary wang', 'shandong'), ('de8ug', 'guangzhou')]
The above research is done, and the results of the research are directly put into the function to solve the problem of duplicate lines and blank lines.
Note that at this point we're dealing with row data, so we don't loop by column anymore. Moreover, when processed in the current sheet, the contents of each line are changed or deleted. So we use old_rows= [x for X in sheet. values] to fetch the data for each old row. Note that the sheet here fetches the data directly with values, not the cell object. Here old_rows is a list, so you can use the previous research to delete duplicate and empty data directly.
Next, use sheet.delete_rows(1, sheet.max_row)
Delete all rows. The first parameter starts at the first row and the second parameter is the maximum number of rows. Finally, the new data is written to the current sheet by cycling the new row data.
In [189]: def handle_duplicate(wb, sheetname): """ //Remove duplicate lines, blank lines //Take out each line first, empty the sheet, and write back after processing. """ print(f'Start processing the worksheet:{sheetname}'.center(18, '-')) sheet = wb[sheetname] old_rows = [x for x in sheet.values] print('Before amendment:', old_rows) new_rows = list(x for x in dict.fromkeys(old_rows) if len(x[0])>1) print('After modification->>', new_rows) # Delete all rows sheet.delete_rows(1, sheet.max_row) # Write new data for row in new_rows: sheet.append(row)
Run the test to see the results. Again, remember to test! If there are any errors, check the code according to the error prompt, debug repeatedly and remove the bugs.
In [190]: wb = load_data() handle_duplicate(wb, 'Repeated rows') save_as(wb)
2. Delete spaces
String functions are also needed to delete spaces, so here's a brief study. If we want to remove the space in the middle of the string, we can split it by default, and then connect the results of the segmentation with the'. join method. Notice that the string before the join is empty. There's no need to strip out the spaces at both ends, because split splits a list of only a few last strings.
In [192]: a="a b c " In [194]: a.strip() Out[194]: 'a b c' In [195]: a.split() Out[195]: ['a', 'b', 'c'] In [196]: ''.join(a.split()) Out[196]: 'abc' In [ ]:
After successful research, write the function. This time it's called handle_blank.
In [197]: def handle_blank(wb, sheetname): """ //Cycle by column to confirm the target by parameters """ print(f'Start processing the worksheet:{sheetname}'.center(18, '-')) sheet = wb[sheetname] for col in sheet.iter_cols(): # Loop all columns without parameters for cell in col: print('Before amendment:', cell.value, end='') cell.value = ''.join(cell.value.split()) print('After modification->>',cell.value) In [198]: handle_blank(wb, 'Blank space')
3. Modify date and time format
Sometimes, we need to modify the format of the time-related cells in the table. Here we need to use the time module datetime in Python. After splicing the format we need, we use strftime to convert it.
Suppose we want to change the previous simple format of 1/November day to that of year-month-day, with a separator/or-, we need to use "% x" or "% Y-%m-%d" to operate. Note that the% plus letters here are officially defined formats, so we can use them to stitch together and pass them to the function.
More specific stitching formats are as follows:
In [199]: import datetime In [209]: d=datetime.datetime(2019,1,11) In [203]: d.strftime("%x") Out[203]: '01/11/19' In [205]: d.strftime("%Y-%m-%d") Out[205]: '2019-01-11'
After the study is completed, we write the function.
First, we need to use m, d = cell.value.split('/') to divide the previous simple date, get m, representing month and date, then use datetime to transform, generate time-related object day, note that the parameters are numbers, so use int conversion, and finally format the day output. When you write a function, you must remember to test it.
In [218]: def handle_time(wb, sheetname): """ //Cycle by column to confirm the target by parameters """ print(f'Start processing the worksheet:{sheetname}'.center(18, '-')) sheet = wb[sheetname] for col in sheet.iter_cols(max_col=1, min_row=2): # Find the column of time, the first column, starting with the second row for cell in col: print('Before amendment:', cell.value, end='') m, d = cell.value.split('/') day = datetime.datetime(2019, int(m), int(d)) cell.value = day.strftime("%Y-%m-%d") print('After modification->>',cell.value) In [220]: wb = load_data() handle_time(wb, 'time') save_as(wb)
4. Repair numbers and symbols
Next, the operations related to numbers and symbols are processed. Before joining us, many of the prices have decimal points. At this time, we want to save two decimal points and add the RMB symbol as the prefix. A new wave of research is needed.
There are decimal points. One is to ensure the number of digits. We need two digits here. The other is to round the redundant digits. There are two ways to do this, one is Decimal and the other is round. The difference is that when Decimal("0.00") specifies a number of digits, it will automatically fill in 0, and round will automatically drop when it encounters 0. And round is a little special in rounding calculation. Specifically, you can see the official documents.
Here we use Decimal to complete the relevant operations within the function. Remember the test!
In [227]: from decimal import Decimal In [240]: a = 3.1 b=Decimal(a).quantize(Decimal("0.00")) print(b) 3.10 In [244]: round(a,2) # Number automatic ellipsis 0 Out[244]: 3.1 In [247]: def handle_num(wb, sheetname): """ //Cycle by column to confirm the target by parameters """ print(f'Start processing the worksheet:{sheetname}'.center(18, '-')) sheet = wb[sheetname] for col in sheet.iter_cols(min_col=3, max_col=3, min_row=2): # Find the column of time, the first column, starting with the second row for cell in col: print('Before amendment:', cell.value, end='') # cell.value = round(float(cell.value), 3) cell.value = '¥' + str(Decimal(cell.value).quantize(Decimal("0.00"))) print('After modification->>',cell.value) In [249]: wb = load_data() handle_num(wb, 'Digit symbol') save_as(wb)
9 yuan micro column, take you Playing excel with Python
There are also group answering services.