Tensorflow2.0 learning notes

Posted by kevdotbadger on Fri, 18 Feb 2022 03:21:43 +0100

1 tf.data

tf.data is tenosrflow2 The module used to create datasets in 0 contains many types of dataset classes, among which dataset and TfRecordDataset are commonly used to encapsulate datasets. Dataset contains a large amount of existing data. TFRecord is a special data format in TF, which can improve the data reading and writing speed.

1.1 tf.data.Dataset

1.1.1 basic API of dataset

  1. from_tensor_slices: this method is used to create datasets. The incoming parameters can be lists, tuples, dictionaries and numpy arrays, but the first dimension size of each element must be equal. This operation encapsulates the incoming parameters into a Dataset in the form of slices.
# Create a new Dataset
# 1. Create Dataset from list
dataset = tf.data.Dataset.from_tensor_slices([1,2,3])
print("see dataset Element value, shape, and type")
for element in dataset:
    print(element)
print("View element values only")
for element in dataset.as_numpy_iterator():
    print(element)
# 2. Create Dataset from numpy array
arr = np.asarray(([1,2],[3,4],[5,6]))
dataset = tf.data.Dataset.from_tensor_slices(arr)
for element in dataset.as_numpy_iterator():
    print(element)
# 3. Create Dataset from tuple
dataset = tf.data.Dataset.from_tensor_slices(([[1, 2],[1,2]], [3, 4], [5, 6]))
for element in dataset.as_numpy_iterator():
    print(element)

# The shape of the first dimension of the element must be the same
dataset = tf.data.Dataset.from_tensor_slices(([1,2,3], [3, 4], [5, 6])) 
# ValueError: Dimensions 3 and 2 are not compatible

# 4. Create Dataset from dictionary
dataset = tf.data.Dataset.from_tensor_slices({"a": [1, 2], "b": [3, 4]})
for element in dataset.as_numpy_iterator():
    print(element)
 

2.repeat/batch: used for data set repetition and batch. repeat and batch can be nested

dataset = tf.data.Dataset.from_tensor_slices([1,2,3])
dataset1 = dataset.repeat(4)
print(list(dataset1.as_numpy_iterator()))
dataset2 = dataset1.batch(5)
print(list(dataset2.as_numpy_iterator()))
# If you want the output to have the same shape, drop_remainder=True
dataset3 = dataset.repeat(4).batch(5,drop_remainder=True)
print(list(dataset3.as_numpy_iterator()))
  1. interleave: used to generate and process data sets, and can process multiple data sets in parallel. The specific operations of this method are: cycle_ Apply map to length input elements_ Func generates a new Dataset. Each element is used as a container and iterates to generate a block_length consecutive elements until exhausted.

interleave(
map_func, cycle_length=None, block_length=None, num_parallel_calls=None,
deterministic=None, name=None
)

dataset = tf.data.Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
# NOTE: New lines indicate "block" boundaries.
dataset = dataset.interleave(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(6),
    cycle_length=3, block_length=4)
list(dataset.as_numpy_iterator())

Interleave can also be used to create Dataset data sets from csv files. The specific operation methods are as follows: 1 First create a file name data set containing the file name (see 1.1.2) Use the interleave method to read the files in the file name Dataset and merge multiple files to form a complete Dataset.

1.1.2 create Dataset from csv file

1.list_files: list_ Files is used to match the files in the path and generate the file name data set. If the matched file list has been sorted out, you can directly use from_tensor_slices creates a file name dataset.

# 1. Use from_tensor_slices creates a file name dataset
## You need to import the directory of the csv file, read the file name, and generate_ The csv file contains trian, test and valid csv files. The train file is extracted and stored in the list.
filename_list = os.listdir('D:\\Projects_File\\Jupyter projects\\tensorflow2.0_course\\chapter_4\\generate_csv')
train_filename = []
for filename in filename_list:
    if not filename.find('train'):
        train_filename.append(filename)
filename_dataset = tf.data.Dataset.from_tensor_slices(train_filename)
for filename in filename_dataset:
    print(filename)
# 2. Use list_files creates a file name dataset and uses wildcards to match the train file
filename_dataset = tf.data.Dataset.list_files(".\\tensorflow2.0_course\\chapter_4\\generate_csv\\train*.csv",shuffle=False)
for idx, filename in enumerate(filename_dataset):
    print(idx, filename)
# Use interleave to merge multiple files into a complete dataset, n_readers represents the number of merged files
n_readers = 5
dataset = filename_dataset.interleave(
    lambda filename: tf.data.TextLineDataset(filename).skip(1),
    cycle_length = n_readers,
    block_length = 2)
for line in dataset.take(15):
    print(line.numpy())

1.1.3 parsing csv files

CSV is a general and relatively simple comma separated value file format. It is a plain text file used to store data; Plain text means that the CSV file is a character sequence, so it needs to be parsed into numerical data for in-depth learning.

  1. tf.io.decode_csv: convert CSV files to tensors. Map each tensor to a column. record_defaults is the data type that needs to be parsed. The length of the list must correspond to the number of columns in the CSV file, otherwise an error will be reported.
def parse_csv_line(line, n_fields = 9):
    defs = [tf.constant(np.nan)] * n_fields
    parsed_fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(parsed_fields[0:-1])
    y = tf.stack(parsed_fields[-1:])
    return x, y

parse_csv_line(b'-0.9868720801669367,0.832863080552588,-0.18684708416901633,-0.14888949288707784,-0.4532302419670616,-0.11504995754593579,1.6730974284189664,-0.7465496877362412,1.138',
               n_fields=9)

1.2 tf.data.TFRecordDataset

1.2.1 create TFRecord file

TFRecord and TF Introduction of example: in order to read data efficiently, a more helpful way is to serialize the data and store it in a set of linearly readable files (100-200MB each). This is especially applicable to data streaming through the network. This is also useful for buffering any data preprocessing. tensorflow official help document link

tf. The structure of example is as follows:
->tf. train. Example: the content stored in the tfrecord file is example
---->tf. train. Features: Example contains multiple features. The format of features is dict {"key": tf.train.Feature}
-------->tf. train. Feature: the value of a feature has a specific format -- TF train. ByteList/FloatList/Int64List

favorite_books = [name.encode('utf-8')
                  for name in ["machine learning", "cc150"]]
favorite_books_bytelist = tf.train.BytesList(value = favorite_books)
print(favorite_books_bytelist)

hours_floatlist = tf.train.FloatList(value = [15.5, 9.5, 7.0, 8.0])
print(hours_floatlist)

age_int64list = tf.train.Int64List(value = [42])
print(age_int64list)

# Build Features in the format of {"key": value}
features = tf.train.Features(
    feature = {
        "favorite_books": tf.train.Feature(
            bytes_list = favorite_books_bytelist),
        "hours": tf.train.Feature(
            float_list = hours_floatlist),
        "age": tf.train.Feature(int64_list = age_int64list),
    }
)
print(features)

# Write tfrecord file after serializing exmaple
example = tf.train.Example(features=features)
print(example)

serialized_example = example.SerializeToString()
print(serialized_example)

# Save tfrecord file
output_dir = 'tfrecord_basic'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
filename = "test.tfrecords"
filename_fullpath = os.path.join(output_dir, filename)
with tf.io.TFRecordWriter(filename_fullpath) as writer:
    for i in range(3):
        writer.write(serialized_example)

1.2.2 create TFRecordDataset from TFRecord file

dataset = tf.data.TFRecordDataset([filename_fullpath])
for serialized_example_tensor in dataset:
    print(serialized_example_tensor)

1.2.3 parsing TFRecord files

Usage: TF io. parse_ For example, parsing a TFRecord file is similar to parsing a csv file. You must first set the data type to be parsed.

expected_features = {
    "favorite_books": tf.io.VarLenFeature(dtype = tf.string),
    "hours": tf.io.VarLenFeature(dtype = tf.float32),
    "age": tf.io.FixedLenFeature([], dtype = tf.int64),
}
dataset = tf.data.TFRecordDataset([filename_fullpath])
for serialized_example_tensor in dataset:
    example = tf.io.parse_single_example(
        serialized_example_tensor,
        expected_features)
    books = tf.sparse.to_dense(example["favorite_books"],
                               default_value=b"")
    for book in books:
        print(book.numpy().decode("UTF-8"))

Topics: Python TensorFlow