Fix and augment check-for-inclusive-language CI check (#29549)

* Fix and augment `check-for-inclusive-language` CI check Related: #15994 #23090 This PR addresses a few items related to inclusive language use and the CI check: - There are several occurrences of "dummy" throughout documentation; however the current CI check for preventing non-inclusive language doesn't inspect `docs/` files. Ideally, the docs also include inclusive language. - Even when removing `docs/` from the exclusion list, non-inclusive language was still escaping pygrep. Upon inspection, the `(?x)` inline modifier was missing from the regex (although intended in #23090). Adding this modifier revealed these "dummy" instances and others related non-inclusive occurrences which were previously uncaught. - The exclusion list seemed too broad in places. There are still instances in which directories are excluded as a whole, but the list now is more tailored to non-inclusive occurrences that are beyond the purview of Airflow, history, dev/test files, etc. * Update links in get_pandas_df() of BigQueryHook * dummy-command -> placeholder-command in _KubernetesDecoratedOperator
apache · Feb 22, 2023 · dba390e · dba390e
1 parent 47ebe99
commit dba390e
Show file tree

Hide file tree

Showing 24 changed files with 76 additions and 48 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -438,7 +438,7 @@ repos:
  name: Check for language that we do not accept as community
  description: Please use more appropriate words for community documentation.
  entry: >
- (?i)
+ (?ix)
  (black|white)[_-]?list|
  \bshe\b|
  \bhe\b|
@@ -451,25 +451,53 @@ repos:
  pass_filenames: true
  exclude: >
  (?x)
- ^airflow/api_connexion/openapi/v1\.yaml$|
- ^airflow/cli/commands/webserver_command\.py$|
- ^airflow/config_templates/config\.yml$|
- ^airflow/config_templates/default_airflow\.cfg$|
- ^airflow/providers/|
- ^airflow/www/fab_security/manager\.py$|
- ^airflow/www/static/|
- ^docs/.*$|
- ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra\.rst$|
- ^docs/apache-airflow-providers-apache-hive/commits\.rst$|
- ^tests/cli/commands/test_internal_api_command\.py$|
- ^tests/cli/commands/test_webserver_command\.py$|
- ^tests/integration/providers/apache/cassandra/hooks/test_cassandra\.py$|
- ^tests/providers/|
- ^tests/providers/apache/cassandra/hooks/test_cassandra\.py$|
- ^tests/system/providers/apache/spark/example_spark_dag\.py$|
+ ^airflow/api_connexion/openapi/v1.yaml$|
+ ^airflow/cli/commands/webserver_command.py$|
+ ^airflow/config_templates/|
+ ^airflow/models/baseoperator.py$|
+ ^airflow/operators/__init__.py$|
+ ^airflow/providers/amazon/aws/hooks/emr.py$|
+ ^airflow/providers/amazon/aws/operators/emr.py$|
+ ^airflow/providers/apache/cassandra/hooks/cassandra.py$|
+ ^airflow/providers/apache/hive/operators/hive_stats.py$|
+ ^airflow/providers/apache/hive/transfers/vertica_to_hive.py$|
+ ^airflow/providers/apache/spark/hooks/|
+ ^airflow/providers/apache/spark/operators/|
+ ^airflow/providers/exasol/hooks/exasol.py$|
+ ^airflow/providers/google/cloud/hooks/bigquery.py$|
+ ^airflow/providers/google/cloud/operators/cloud_build.py$|
+ ^airflow/providers/google/cloud/operators/dataproc.py$|
+ ^airflow/providers/google/cloud/operators/mlengine.py$|
+ ^airflow/providers/microsoft/azure/hooks/cosmos.py$|
+ ^airflow/providers/microsoft/winrm/hooks/winrm.py$|
+ ^airflow/www/fab_security/manager.py$|
+ ^docs/.*commits.rst$|
+ ^docs/apache-airflow/administration-and-deployment/security/webserver.rst$|
+ ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra.rst$|
+ ^airflow/providers/microsoft/winrm/operators/winrm.py$|
+ ^airflow/providers/opsgenie/hooks/opsgenie.py$|
+ ^airflow/providers/redis/provider.yaml$|
+ ^airflow/serialization/serialized_objects.py$|
+ ^airflow/utils/db.py$|
+ ^airflow/utils/trigger_rule.py$|
+ ^airflow/www/static/css/bootstrap-theme.css$|
+ ^airflow/www/static/js/types/api-generated.ts$|
+ ^airflow/www/templates/appbuilder/flash.html$|
+ ^dev/|
+ ^docs/README.rst$|
+ ^docs/apache-airflow-providers-amazon/secrets-backends/aws-ssm-parameter-store.rst$|
+ ^docs/apache-airflow-providers-apache-hdfs/connections.rst$|
+ ^docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst$|
+ ^docs/apache-airflow-providers-microsoft-azure/connections/azure_cosmos.rst$|
+ ^docs/conf.py$|
+ ^docs/exts/removemarktransform.py$|
+ ^scripts/ci/pre_commit/pre_commit_vendor_k8s_json_schema.py$|
+ ^tests/|
  ^.pre-commit-config\.yaml$|
  ^.*CHANGELOG\.(rst|txt)$|
  ^.*RELEASE_NOTES\.rst$|
+ ^CONTRIBUTORS_QUICK_START.rst$|
+ ^.*\.(png|gif|jp[e]?g|tgz|lock)$|
  git
  - id: check-base-operator-partial-arguments
  name: Check BaseOperator and partial() arguments

diff --git a/airflow/api_connexion/endpoints/connection_endpoint.py b/airflow/api_connexion/endpoints/connection_endpoint.py
@@ -178,16 +178,16 @@ def test_connection() -> APIResponse:
  """
  Test an API connection.
 
- This method first creates an in-memory dummy conn_id & exports that to an
+ This method first creates an in-memory transient conn_id & exports that to an
  env var, as some hook classes tries to find out the conn from their __init__ method & errors out
  if not found. It also deletes the conn id env variable after the test.
  """
  body = request.json
- dummy_conn_id = get_random_string()
- conn_env_var = f"{CONN_ENV_PREFIX}{dummy_conn_id.upper()}"
+ transient_conn_id = get_random_string()
+ conn_env_var = f"{CONN_ENV_PREFIX}{transient_conn_id.upper()}"
  try:
  data = connection_schema.load(body)
- data["conn_id"] = dummy_conn_id
+ data["conn_id"] = transient_conn_id
  conn = Connection(**data)
  os.environ[conn_env_var] = conn.get_uri()
  status, message = conn.test_connection()

diff --git a/airflow/kubernetes/pod_template_file_examples/dags_in_image_template.yaml b/airflow/kubernetes/pod_template_file_examples/dags_in_image_template.yaml
@@ -22,7 +22,7 @@
 apiVersion: v1
 kind: Pod
 metadata:
- name: dummy-name
+ name: placeholder-name
 spec:
  containers:
  - env:

diff --git a/airflow/kubernetes/pod_template_file_examples/dags_in_volume_template.yaml b/airflow/kubernetes/pod_template_file_examples/dags_in_volume_template.yaml
@@ -22,7 +22,7 @@
 apiVersion: v1
 kind: Pod
 metadata:
- name: dummy-name
+ name: placeholder-name
 spec:
  containers:
  - env:

diff --git a/airflow/kubernetes_executor_templates/basic_template.yaml b/airflow/kubernetes_executor_templates/basic_template.yaml
@@ -18,14 +18,14 @@
 kind: Pod
 apiVersion: v1
 metadata:
- name: dummy-name-dont-delete
- namespace: dummy-name-dont-delete
+ name: placeholder-name-dont-delete
+ namespace: placeholder-name-dont-delete
  labels:
  mylabel: foo
 spec:
  containers:
  - name: base
- image: dummy-name-dont-delete
+ image: placeholder-name-dont-delete
  env:
  - name: AIRFLOW__CORE__EXECUTOR
  value: LocalExecutor

diff --git a/airflow/providers/cncf/kubernetes/decorators/kubernetes.py b/airflow/providers/cncf/kubernetes/decorators/kubernetes.py
@@ -73,7 +73,7 @@ def __init__(self, namespace: str = "default", use_dill: bool = False, **kwargs)
  super().__init__(
  namespace=namespace,
  name=kwargs.pop("name", f"k8s_airflow_pod_{uuid.uuid4().hex}"),
- cmds=["dummy-command"],
+ cmds=["placeholder-command"],
  **kwargs,
  )
 

diff --git a/airflow/providers/docker/decorators/docker.py b/airflow/providers/docker/decorators/docker.py
@@ -90,7 +90,7 @@ def __init__(
  expect_airflow: bool = True,
  **kwargs,
  ) -> None:
- command = "dummy command"
+ command = "placeholder command"
  self.python_command = python_command
  self.expect_airflow = expect_airflow
  self.pickling_library = dill if use_dill else pickle

diff --git a/airflow/providers/google/cloud/hooks/bigquery.py b/airflow/providers/google/cloud/hooks/bigquery.py
@@ -240,8 +240,8 @@ def get_pandas_df(
  query. The DbApiHook method must be overridden because Pandas
  doesn't support PEP 249 connections, except for SQLite. See:
 
- https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447
- https://github.com/pydata/pandas/issues/6900
+ https://github.com/pandas-dev/pandas/blob/055d008615272a1ceca9720dc365a2abd316f353/pandas/io/sql.py#L415
+ https://github.com/pandas-dev/pandas/issues/6900
 
  :param sql: The BigQuery SQL to execute.
  :param parameters: The parameters to render the SQL query with (not

diff --git a/airflow/providers/google/provider.yaml b/airflow/providers/google/provider.yaml
@@ -434,7 +434,7 @@ integrations:
  - /docs/apache-airflow-providers-google/operators/cloud/workflows.rst
  tags: [gcp]
  - integration-name: Google LevelDB
- external-doc-url: https://github.com/google/leveldb/blob/master/doc/index.md
+ external-doc-url: https://github.com/google/leveldb/blob/main/doc/index.md
  how-to-guide:
  - /docs/apache-airflow-providers-google/operators/leveldb/leveldb.rst
  tags: [google]

diff --git a/airflow/utils/dag_edges.py b/airflow/utils/dag_edges.py
@@ -26,9 +26,9 @@ def dag_edges(dag: DAG):
  Create the list of edges needed to construct the Graph view.
 
  A special case is made if a TaskGroup is immediately upstream/downstream of another
- TaskGroup or task. Two dummy nodes named upstream_join_id and downstream_join_id are
+ TaskGroup or task. Two proxy nodes named upstream_join_id and downstream_join_id are
  created for the TaskGroup. Instead of drawing an edge onto every task in the TaskGroup,
- all edges are directed onto the dummy nodes. This is to cut down the number of edges on
+ all edges are directed onto the proxy nodes. This is to cut down the number of edges on
  the graph.
 
  For example: A DAG with TaskGroups group1 and group2:

diff --git a/airflow/utils/task_group.py b/airflow/utils/task_group.py
@@ -375,7 +375,7 @@ def child_id(self, label):
  @property
  def upstream_join_id(self) -> str:
  """
- If this TaskGroup has immediate upstream TaskGroups or tasks, a dummy node called
+ If this TaskGroup has immediate upstream TaskGroups or tasks, a proxy node called
  upstream_join_id will be created in Graph view to join the outgoing edges from this
  TaskGroup to reduce the total number of edges needed to be displayed.
  """
@@ -384,7 +384,7 @@ def upstream_join_id(self) -> str:
  @property
  def downstream_join_id(self) -> str:
  """
- If this TaskGroup has immediate downstream TaskGroups or tasks, a dummy node called
+ If this TaskGroup has immediate downstream TaskGroups or tasks, a proxy node called
  downstream_join_id will be created in Graph view to join the outgoing edges from this
  TaskGroup to reduce the total number of edges needed to be displayed.
  """

diff --git a/chart/files/pod-template-file.kubernetes-helm-yaml b/chart/files/pod-template-file.kubernetes-helm-yaml
@@ -23,7 +23,7 @@
 apiVersion: v1
 kind: Pod
 metadata:
- name: dummy-name
+ name: placeholder-name
  labels:
  tier: airflow
  component: worker

diff --git a/chart/values.schema.json b/chart/values.schema.json
@@ -5242,7 +5242,7 @@
  "x-docsSection": "Airflow",
  "default": null,
  "examples": [
- "apiVersion: v1\nkind: Pod\nmetadata:\n name: dummy-name\n labels:\n tier: airflow\n component: worker\n release: {{ .Release.Name }}\nspec:\n priorityClassName: high-priority\n containers:\n - name: base\n ..."
+ "apiVersion: v1\nkind: Pod\nmetadata:\n name: placeholder-name\n labels:\n tier: airflow\n component: worker\n release: {{ .Release.Name }}\nspec:\n priorityClassName: high-priority\n containers:\n - name: base\n ..."
  ]
  },
  "dags": {

diff --git a/chart/values.yaml b/chart/values.yaml
@@ -1840,7 +1840,7 @@ podTemplate: ~
 # apiVersion: v1
 # kind: Pod
 # metadata:
-# name: dummy-name
+# name: placeholder-name
 # labels:
 # tier: airflow
 # component: worker

diff --git a/docs/apache-airflow-providers-google/operators/leveldb/leveldb.rst b/docs/apache-airflow-providers-google/operators/leveldb/leveldb.rst
@@ -48,5 +48,5 @@ Reference
 
 For further information, look at:
 
-* `Product Documentation <https://github.com/google/leveldb/blob/master/doc/index.md>`__
+* `Product Documentation <https://github.com/google/leveldb/blob/main/doc/index.md>`__
 * `Client Library Documentation <https://plyvel.readthedocs.io/en/latest/>`__
diff --git a/docs/apache-airflow/authoring-and-scheduling/plugins.rst b/docs/apache-airflow/authoring-and-scheduling/plugins.rst
@@ -161,7 +161,7 @@ Make sure you restart the webserver and scheduler after making changes to plugin
 Example
 -------
 
-The code below defines a plugin that injects a set of dummy object
+The code below defines a plugin that injects a set of illustrative object
 definitions in Airflow.
 
 .. code-block:: python

diff --git a/docs/apache-airflow/authoring-and-scheduling/timezone.rst b/docs/apache-airflow/authoring-and-scheduling/timezone.rst
@@ -96,7 +96,7 @@ words if you have a default time zone setting of ``Europe/Amsterdam`` and create
  start_date=pendulum.datetime(2017, 1, 1, tz="UTC"),
  default_args={"retries": 3},
  )
- op = BashOperator(task_id="dummy", bash_command="Hello World!", dag=dag)
+ op = BashOperator(task_id="hello_world", bash_command="Hello World!", dag=dag)
  print(op.retries) # 3
 
 Unfortunately, during DST transitions, some datetimes don't exist or are ambiguous.

diff --git a/docs/apache-airflow/best-practices.rst b/docs/apache-airflow/best-practices.rst
@@ -690,7 +690,7 @@ A *better* way (though it's a bit more manual) is to use the ``dags pause`` comm
 Add "integration test" DAGs
 ---------------------------
 
-It can be helpful to add a couple "integration test" DAGs that use all the common services in your ecosystem (e.g. S3, Snowflake, Vault) but with dummy resources or "dev" accounts. These test DAGs can be the ones you turn on *first* after an upgrade, because if they fail, it doesn't matter and you can revert to your backup without negative consequences. However, if they succeed, they should prove that your cluster is able to run tasks with the libraries and services that you need to use.
+It can be helpful to add a couple "integration test" DAGs that use all the common services in your ecosystem (e.g. S3, Snowflake, Vault) but with test resources or "dev" accounts. These test DAGs can be the ones you turn on *first* after an upgrade, because if they fail, it doesn't matter and you can revert to your backup without negative consequences. However, if they succeed, they should prove that your cluster is able to run tasks with the libraries and services that you need to use.
 
 For example, if you use an external secrets backend, make sure you have a task that retrieves a connection. If you use KubernetesPodOperator, add a task that runs ``sleep 30; echo "hello"``. If you need to write to s3, do so in a test task. And if you need to access a database, add a task that does ``select 1`` from the server.
 

diff --git a/docs/apache-airflow/core-concepts/dags.rst b/docs/apache-airflow/core-concepts/dags.rst
@@ -258,7 +258,7 @@ Often, many Operators inside a DAG need the same set of default arguments (such
  schedule="@daily",
  default_args={"retries": 2},
  ):
- op = BashOperator(task_id="dummy", bash_command="Hello World!")
+ op = BashOperator(task_id="hello_world", bash_command="Hello World!")
  print(op.retries) # 2
 
 

diff --git a/docs/apache-airflow/core-concepts/params.rst b/docs/apache-airflow/core-concepts/params.rst
@@ -139,7 +139,7 @@ JSON Schema Validation
 
  # a required param which can be of multiple types
  # a param must have a default value
- "dummy": Param(5, type=["null", "number", "string"]),
+ "multi_type_param": Param(5, type=["null", "number", "string"]),
 
  # an enum param, must be one of three values
  "enum_param": Param("foo", enum=["foo", "bar", 42]),

diff --git a/docs/apache-airflow/howto/connection.rst b/docs/apache-airflow/howto/connection.rst
@@ -214,7 +214,7 @@ for description on how to add custom providers.
 
 The custom connection types are defined via Hooks delivered by the providers. The Hooks can implement
 methods defined in the protocol class :class:`~airflow.hooks.base_hook.DiscoverableHook`. Note that your
-custom Hook should not derive from this class, this class is a dummy example to document expectations
+custom Hook should not derive from this class, this class is an example to document expectations
 regarding about class fields and methods that your Hook might define. Another good example is
 :py:class:`~airflow.providers.jdbc.hooks.jdbc.JdbcHook`.
 

diff --git a/docs/apache-airflow/tutorial/fundamentals.rst b/docs/apache-airflow/tutorial/fundamentals.rst
@@ -393,7 +393,7 @@ which are used to populate the run schedule with task instances from this DAG.
 What's Next?
 -------------
 That's it! You have written, tested and backfilled your very first Airflow
-pipeline. Merging your code into a repository that has a master scheduler
+pipeline. Merging your code into a repository that has a Scheduler
 running against it should result in being triggered and run every day.
 
 Here are a few things you might want to do next:

diff --git a/docs/helm-chart/customizing-workers.rst b/docs/helm-chart/customizing-workers.rst
@@ -57,7 +57,7 @@ As an example, let's say you want to set ``priorityClassName`` on your workers:
  apiVersion: v1
  kind: Pod
  metadata:
- name: dummy-name
+ name: placeholder-name
  labels:
  tier: airflow
  component: worker

diff --git a/scripts/in_container/check_environment.sh b/scripts/in_container/check_environment.sh
@@ -116,7 +116,7 @@ function startairflow_if_requested() {
  . "$( dirname "${BASH_SOURCE[0]}" )/configure_environment.sh"
 
  airflow db init
- airflow users create -u admin -p admin -f Thor -l Adminstra -r Admin -e dummy@dummy.email
+ airflow users create -u admin -p admin -f Thor -l Adminstra -r Admin -e admin@email.domain
 
  . "$( dirname "${BASH_SOURCE[0]}" )/run_init_script.sh"